Data Loading, Inspection and Visualization¶
Mainly using pandas
# Render our plots inline
%matplotlib inline
# Import requiered packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
We type some code to simply change the visual style of the plots. (The code below is optional and not necessary, and for now you do not need to understand what exactly is happening.)
# Supress warnings
import warnings
warnings.filterwarnings('ignore')
Data Loading¶
One can directly load a csv file from an url with pandas or download it locally first and upload it from the local directory on the computer. The pd.read_csv method has many options, and you can further read in the online documentation.
In the following, we will focus on the NYPD Vehicle Collisions data set.
url = 'https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD'
df = pd.read_csv(url, low_memory=False)
# Let's take a look at the first 5 rows of the dataframe
df.head(5)
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 09/11/2021 | 2:39 | NaN | NaN | NaN | NaN | NaN | WHITESTONE EXPRESSWAY | 20 AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 4455765 | Sedan | Sedan | NaN | NaN | NaN |
| 1 | 03/26/2022 | 11:45 | NaN | NaN | NaN | NaN | NaN | QUEENSBORO BRIDGE UPPER | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4513547 | Sedan | NaN | NaN | NaN | NaN |
| 2 | 06/29/2022 | 6:55 | NaN | NaN | NaN | NaN | NaN | THROGS NECK BRIDGE | NaN | NaN | ... | Unspecified | NaN | NaN | NaN | 4541903 | Sedan | Pick-up Truck | NaN | NaN | NaN |
| 3 | 09/11/2021 | 9:35 | BROOKLYN | 11208 | 40.667202 | -73.866500 | (40.667202, -73.8665) | NaN | NaN | 1211 LORING AVENUE | ... | NaN | NaN | NaN | NaN | 4456314 | Sedan | NaN | NaN | NaN | NaN |
| 4 | 12/14/2021 | 8:13 | BROOKLYN | 11233 | 40.683304 | -73.917274 | (40.683304, -73.917274) | SARATOGA AVENUE | DECATUR STREET | NaN | ... | NaN | NaN | NaN | NaN | 4486609 | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
Data Inspection/Visualization¶
Using the info() method you can obtain a concise summary of the data, including the data types under which each column has been saved.
We can use the method describe() to get some statistics of the numeric attributes in the DataFrame.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2026647 entries, 0 to 2026646 Data columns (total 29 columns): # Column Dtype --- ------ ----- 0 CRASH DATE object 1 CRASH TIME object 2 BOROUGH object 3 ZIP CODE object 4 LATITUDE float64 5 LONGITUDE float64 6 LOCATION object 7 ON STREET NAME object 8 CROSS STREET NAME object 9 OFF STREET NAME object 10 NUMBER OF PERSONS INJURED float64 11 NUMBER OF PERSONS KILLED float64 12 NUMBER OF PEDESTRIANS INJURED int64 13 NUMBER OF PEDESTRIANS KILLED int64 14 NUMBER OF CYCLIST INJURED int64 15 NUMBER OF CYCLIST KILLED int64 16 NUMBER OF MOTORIST INJURED int64 17 NUMBER OF MOTORIST KILLED int64 18 CONTRIBUTING FACTOR VEHICLE 1 object 19 CONTRIBUTING FACTOR VEHICLE 2 object 20 CONTRIBUTING FACTOR VEHICLE 3 object 21 CONTRIBUTING FACTOR VEHICLE 4 object 22 CONTRIBUTING FACTOR VEHICLE 5 object 23 COLLISION_ID int64 24 VEHICLE TYPE CODE 1 object 25 VEHICLE TYPE CODE 2 object 26 VEHICLE TYPE CODE 3 object 27 VEHICLE TYPE CODE 4 object 28 VEHICLE TYPE CODE 5 object dtypes: float64(4), int64(7), object(18) memory usage: 448.4+ MB
df.describe()
| LATITUDE | LONGITUDE | NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | COLLISION_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.796344e+06 | 1.796344e+06 | 2.026629e+06 | 2.026616e+06 | 2.026647e+06 | 2.026647e+06 | 2.026647e+06 | 2.026647e+06 | 2.026647e+06 | 2.026647e+06 | 2.026647e+06 |
| mean | 4.062776e+01 | -7.375228e+01 | 3.036264e-01 | 1.454148e-03 | 5.528639e-02 | 7.263228e-04 | 2.630749e-02 | 1.115142e-04 | 2.187850e-01 | 5.950716e-04 | 3.122850e+06 |
| std | 1.980800e+00 | 3.726823e+00 | 6.948115e-01 | 4.018541e-02 | 2.415744e-01 | 2.743065e-02 | 1.619910e-01 | 1.060607e-02 | 6.559466e-01 | 2.661214e-02 | 1.504145e+06 |
| min | 0.000000e+00 | -2.013600e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.200000e+01 |
| 25% | 4.066792e+01 | -7.397492e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.142782e+06 |
| 50% | 4.072097e+01 | -7.392732e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.649550e+06 |
| 75% | 4.076956e+01 | -7.386668e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.156458e+06 |
| max | 4.334444e+01 | 0.000000e+00 | 4.300000e+01 | 8.000000e+00 | 2.700000e+01 | 6.000000e+00 | 4.000000e+00 | 2.000000e+00 | 4.300000e+01 | 5.000000e+00 | 4.663399e+06 |
The shape property allows you to see how many rows and columns there are.
df.shape
(2026647, 29)
# Number of rows/observations - the data numerosity
df.shape[0]
2026647
# Number of features/column/attributes - the data dimensionality
df.shape[1]
29
We can also list the columns and check the data types for each column using dtypes.
df.columns
Index(['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
dtype='object')
df.dtypes
CRASH DATE object CRASH TIME object BOROUGH object ZIP CODE object LATITUDE float64 LONGITUDE float64 LOCATION object ON STREET NAME object CROSS STREET NAME object OFF STREET NAME object NUMBER OF PERSONS INJURED float64 NUMBER OF PERSONS KILLED float64 NUMBER OF PEDESTRIANS INJURED int64 NUMBER OF PEDESTRIANS KILLED int64 NUMBER OF CYCLIST INJURED int64 NUMBER OF CYCLIST KILLED int64 NUMBER OF MOTORIST INJURED int64 NUMBER OF MOTORIST KILLED int64 CONTRIBUTING FACTOR VEHICLE 1 object CONTRIBUTING FACTOR VEHICLE 2 object CONTRIBUTING FACTOR VEHICLE 3 object CONTRIBUTING FACTOR VEHICLE 4 object CONTRIBUTING FACTOR VEHICLE 5 object COLLISION_ID int64 VEHICLE TYPE CODE 1 object VEHICLE TYPE CODE 2 object VEHICLE TYPE CODE 3 object VEHICLE TYPE CODE 4 object VEHICLE TYPE CODE 5 object dtype: object
The object type is a string. For some of them, we would like to change the data types using for example the pd.to_datetime functions. To this end, we first need to understand how to parse dates using the Python conventions.
The relevant entries from the table are:
%mMonth as a zero-padded decimal number.%dDay of the month as a zero-padded decimal number.%YYear with century as a decimal number.
Now, we can specify how to parse the dates.
df["CRASH DATE"] = pd.to_datetime(df["CRASH DATE"], format="%m/%d/%Y")
Selecting columns¶
cols = ["CRASH DATE", "BOROUGH", "NUMBER OF PERSONS INJURED"]
df[cols]
# df[["CRASH DATE", "BOROUGH", "NUMBER OF PERSONS INJURED"]]
| CRASH DATE | BOROUGH | NUMBER OF PERSONS INJURED | |
|---|---|---|---|
| 0 | 2021-09-11 | NaN | 2.0 |
| 1 | 2022-03-26 | NaN | 1.0 |
| 2 | 2022-06-29 | NaN | 0.0 |
| 3 | 2021-09-11 | BROOKLYN | 0.0 |
| 4 | 2021-12-14 | BROOKLYN | 0.0 |
| ... | ... | ... | ... |
| 2026642 | 2023-07-03 | NaN | 0.0 |
| 2026643 | 2023-07-22 | BRONX | 1.0 |
| 2026644 | 2023-07-02 | MANHATTAN | 0.0 |
| 2026645 | 2023-07-22 | QUEENS | 1.0 |
| 2026646 | 2023-07-22 | QUEENS | 0.0 |
2026647 rows × 3 columns
Selecting rows¶
df[0:5]
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-09-11 | 2:39 | NaN | NaN | NaN | NaN | NaN | WHITESTONE EXPRESSWAY | 20 AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 4455765 | Sedan | Sedan | NaN | NaN | NaN |
| 1 | 2022-03-26 | 11:45 | NaN | NaN | NaN | NaN | NaN | QUEENSBORO BRIDGE UPPER | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4513547 | Sedan | NaN | NaN | NaN | NaN |
| 2 | 2022-06-29 | 6:55 | NaN | NaN | NaN | NaN | NaN | THROGS NECK BRIDGE | NaN | NaN | ... | Unspecified | NaN | NaN | NaN | 4541903 | Sedan | Pick-up Truck | NaN | NaN | NaN |
| 3 | 2021-09-11 | 9:35 | BROOKLYN | 11208 | 40.667202 | -73.866500 | (40.667202, -73.8665) | NaN | NaN | 1211 LORING AVENUE | ... | NaN | NaN | NaN | NaN | 4456314 | Sedan | NaN | NaN | NaN | NaN |
| 4 | 2021-12-14 | 8:13 | BROOKLYN | 11233 | 40.683304 | -73.917274 | (40.683304, -73.917274) | SARATOGA AVENUE | DECATUR STREET | NaN | ... | NaN | NaN | NaN | NaN | 4486609 | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
Selecting both rows and columns by name (df.loc) or by position (df.iloc)¶
df.loc[0:5, ["CRASH DATE", "BOROUGH", "NUMBER OF PERSONS INJURED"]]
| CRASH DATE | BOROUGH | NUMBER OF PERSONS INJURED | |
|---|---|---|---|
| 0 | 2021-09-11 | NaN | 2.0 |
| 1 | 2022-03-26 | NaN | 1.0 |
| 2 | 2022-06-29 | NaN | 0.0 |
| 3 | 2021-09-11 | BROOKLYN | 0.0 |
| 4 | 2021-12-14 | BROOKLYN | 0.0 |
| 5 | 2021-04-14 | NaN | 0.0 |
df.iloc[[1,4], 0:3]
| CRASH DATE | CRASH TIME | BOROUGH | |
|---|---|---|---|
| 1 | 2022-03-26 | 11:45 | NaN |
| 4 | 2021-12-14 | 8:13 | BROOKLYN |
# You can also change the value of an observation directly in the data frame
# df.loc[0, "BOROUGH"] = "BROOKLYN"
Boolean Indexing¶
To filter rows of a certain kind. Below an example to select data in certain area specified by latitude and longitude ranges.
boolean_condition = (df.LONGITUDE<-50) & (df.LONGITUDE>-74.5) & (df.LATITUDE< 41)
df_filtered = df[boolean_condition]
# same as:
# df_filtered = df[(df.LONGITUDE<-50) & (df.LONGITUDE>-74.5) & (df.LATITUDE< 41)]
Aggregation¶
Below you find some aggregation examples.
# Sum up the number of all injured persons per borough for all different boroughs
df.groupby("BOROUGH", as_index=False)["NUMBER OF PERSONS INJURED"].sum()
| BOROUGH | NUMBER OF PERSONS INJURED | |
|---|---|---|
| 0 | BRONX | 66156.0 |
| 1 | BROOKLYN | 144095.0 |
| 2 | MANHATTAN | 64655.0 |
| 3 | QUEENS | 110271.0 |
| 4 | STATEN ISLAND | 16405.0 |
# Apply multiple aggregation functions (here "sum" and "max")
df.groupby("BOROUGH", as_index=False)["NUMBER OF PERSONS INJURED"].agg({"SUM INJURED": "sum", "MAX INJURED": "max"})
| BOROUGH | SUM INJURED | MAX INJURED | |
|---|---|---|---|
| 0 | BRONX | 66156.0 | 31.0 |
| 1 | BROOKLYN | 144095.0 | 43.0 |
| 2 | MANHATTAN | 64655.0 | 27.0 |
| 3 | QUEENS | 110271.0 | 34.0 |
| 4 | STATEN ISLAND | 16405.0 | 22.0 |
# Apply an aggregating function on multiple variables by multiple aggregating dimensions
df.groupby(["BOROUGH","VEHICLE TYPE CODE 1"])[["NUMBER OF PERSONS INJURED","CONTRIBUTING FACTOR VEHICLE 1"]].sum()
| NUMBER OF PERSONS INJURED | CONTRIBUTING FACTOR VEHICLE 1 | ||
|---|---|---|---|
| BOROUGH | VEHICLE TYPE CODE 1 | ||
| BRONX | 197209 | 0.0 | Unspecified |
| 1S | 0.0 | Passing or Lane Usage Improper | |
| 2 DOO | 0.0 | Unspecified | |
| 2 dr sedan | 49.0 | UnspecifiedUnspecifiedOther VehicularUnspecifi... | |
| 3-Door | 12.0 | Driver Inattention/DistractionPassing Too Clos... | |
| ... | ... | ... | ... |
| STATEN ISLAND | trailer | 1.0 | Driver InexperienceBacking Unsafely |
| unkno | 0.0 | Turning Improperly | |
| usps | 0.0 | Failure to Keep Right | |
| van | 1.0 | Turning ImproperlyFailure to Yield Right-of-Wa... | |
| van t | 1.0 | Unsafe Speed |
2254 rows × 2 columns
Histograms¶
One can examine the distribution of values by using the hist command of Pandas, which creates a histogram. (The histogram is also available as plot.hist(), or plot(kind='hist')).
df["NUMBER OF PERSONS INJURED"].hist()
# df_filtered["NUMBER OF PERSONS INJURED"].plot(kind='hist')
<Axes: >
By default, the histogram has ~10 bars. We can change the resolution of the histogram using the bins attribute. Larger numbers of bins allow for higher resolution.
df["NUMBER OF PERSONS INJURED"].hist(bins=50)
<Axes: >
# A quick exposure to various options of the "hist" command
df["NUMBER OF PERSONS INJURED"].hist(
bins=20, # use 20 bars
range=(0,10), # x-axis from 0 to 10
density=False, # show normalized count (density=True), or raw counts (density= False)
figsize=(15,5), # controls the size of the plot
alpha=0.8, # make the plot 20% transparent
color='green' # change color
)
<Axes: >
Kernel Density Estimation (KDE)¶
An alternative to histograms is to use the kernel density, which estimates a continuous function, instead of the bucketized counts, which tends to be discontinuous and bumpy. We can access this usind the .plot(kind='kde') command.
Let's see an example.
df["NUMBER OF PERSONS INJURED"].plot(
kind='kde',
color='Black',
xlim=(0,5),
figsize=(15,5)
)
<Axes: ylabel='Density'>
Analyzing the content of categorical columns¶
We can also get quick statistics about the common values that appear in each column:
df["BOROUGH"].value_counts()
BOROUGH BROOKLYN 443172 QUEENS 373975 MANHATTAN 314361 BRONX 206178 STATEN ISLAND 58520 Name: count, dtype: int64
And we can use the "plot" command to plot the resulting bar plot showing the counts (more detail at http://pandas.pydata.org/pandas-docs/stable/visualization.html).
df["BOROUGH"].value_counts().plot(kind='bar')
<Axes: xlabel='BOROUGH'>
# horizontal bars (another way to access column in Pandas when there aren't empty spaces in the column name)
df_filtered.BOROUGH.value_counts().plot(kind='barh')
<Axes: ylabel='BOROUGH'>
Pivot Tables¶
Pivot tables is one of the most commonly used exploratory tools, and in Pandas they are extremely flexible.
Let's use them to break down the accidents by borough and contributing factor.
pivot = pd.pivot_table(
data = df,
index = 'CONTRIBUTING FACTOR VEHICLE 1',
columns = 'BOROUGH',
aggfunc = 'count',
values = 'COLLISION_ID'
)
pivot.head(10)
| BOROUGH | BRONX | BROOKLYN | MANHATTAN | QUEENS | STATEN ISLAND |
|---|---|---|---|---|---|
| CONTRIBUTING FACTOR VEHICLE 1 | |||||
| 1 | NaN | 3.0 | 1.0 | 2.0 | 2.0 |
| 80 | 12.0 | 19.0 | 9.0 | 29.0 | 1.0 |
| Accelerator Defective | 133.0 | 247.0 | 121.0 | 217.0 | 50.0 |
| Aggressive Driving/Road Rage | 1271.0 | 1949.0 | 1344.0 | 1396.0 | 229.0 |
| Alcohol Involvement | 2533.0 | 4737.0 | 2278.0 | 4738.0 | 846.0 |
| Animals Action | 128.0 | 216.0 | 87.0 | 279.0 | 223.0 |
| Backing Unsafely | 8822.0 | 17736.0 | 11500.0 | 18473.0 | 2524.0 |
| Brakes Defective | 782.0 | 1407.0 | 716.0 | 1128.0 | 299.0 |
| Cell Phone (hand-Held) | 72.0 | 97.0 | 53.0 | 72.0 | 18.0 |
| Cell Phone (hand-held) | 8.0 | 21.0 | 15.0 | 11.0 | 1.0 |
Examples¶
Example 1: Find the dates with most accidents.
df["CRASH DATE"].value_counts().head(10)
CRASH DATE 2014-01-21 1161 2018-11-15 1065 2017-12-15 999 2017-05-19 974 2015-01-18 961 2014-02-03 960 2015-03-06 939 2017-05-18 911 2017-01-07 896 2018-03-02 884 Name: count, dtype: int64
Example 2: Find out the 10 most common contributing factors to the collisions.
df_filtered['CONTRIBUTING FACTOR VEHICLE 1'].value_counts().head(11)
CONTRIBUTING FACTOR VEHICLE 1 Unspecified 610815 Driver Inattention/Distraction 362502 Failure to Yield Right-of-Way 108874 Following Too Closely 92487 Backing Unsafely 68667 Other Vehicular 55679 Passing or Lane Usage Improper 50183 Passing Too Closely 46874 Turning Improperly 43240 Fatigued/Drowsy 37696 Unsafe Lane Changing 34075 Name: count, dtype: int64
Now let's plot a histogram of the above list. Note that we skip the first element.
df_filtered['CONTRIBUTING FACTOR VEHICLE 1'].value_counts()[1:11].plot(kind='barh')
<Axes: ylabel='CONTRIBUTING FACTOR VEHICLE 1'>
Example 3: Find out how many collisions had 0 person injured, 1 person injured, etc. persons injured in each accident.
The .plot(logy=True) option is used in the plot to make the y-axis logarigthmic.
plot = (
df['NUMBER OF PERSONS INJURED'] # take the num of injuries column
.value_counts() # compute the freuquency of each value
.sort_index() # sort the results based on the index value instead of the frequency,
# which is the default for value_counts
.plot( # and plot the results
kind='line', # we use a line plot because the x-axis is numeric/continuous
marker='o', # we use a marker to mark where we have data points
logy=True # make the y-axis logarithmic
)
)
plot.set_xlabel("Number of injuries")
plot.set_ylabel("Number of collisions");
Example 3: Plot the number of accidents per day.
Ensure that your date column is in the right datatype and that it is properly sorted, before plotting. The resample command is used to change the frequency from one day, to, say, a month. The drop command is used to delete rows or columns.
# Date converted to proper date format
df["CRASH DATE"] = pd.to_datetime(df["CRASH DATE"], format="%m/%d/%Y")
(
df["CRASH DATE"].value_counts() # count the number of accidents per day
.sort_index() # sort the dates
.resample('1M') # take periods of 1 month
.sum() # sum the number of accidents per month
.drop(pd.to_datetime('2019-04-30'), axis='index') # drop the current month
.plot() # plot the result
)
<Axes: xlabel='CRASH DATE'>
(Optional)Example 4: Plot the accidents in map. To do this we use a scatter plot using the plot(kind='scatter', x=..., y=....) command, and use the LATITUDE and LONGITUDE parameters.
# We do data filtering by specifying a selection condition to limit the lat/long values
# to be values idicating the NYC region. Remaining are probably wrong inputs.
cleandf = df[(df.LONGITUDE<-50) & (df.LONGITUDE>-74.5) & (df.LATITUDE< 41)]
cleandf[ (df.LATITUDE>40) & (df.LATITUDE<41) & (df.LONGITUDE> -74.6) & (df.LONGITUDE<-50) ].plot(
figsize = (20,15),
kind = 'scatter',
x = 'LONGITUDE',
y = 'LATITUDE',
s = 1, # make each dot to be very small
alpha = 0.05 # makes each point 95% transparent
)
<Axes: xlabel='LONGITUDE', ylabel='LATITUDE'>
(Optiional) 2d histograms, density plots, and contour plots¶
In the picture above, we can visually see that Manhattan, especially eastern midtown, and the area downtown near the entrance to the bridges, has a higher density. We can also derive histograms and density plots on 2-dimensions.
Hexagonal bin plot¶
The hexbin plot created a 2-d histogram, where the color signals the number of points within a particular area. The gridsize parameter indicates the number of hexagones in the x direction. Higher values offer higher granularity, but very high values tend to create sparsity, when we do not have enough data points.
# Hexbin plot
cleandf.plot(
kind='hexbin',
x='LONGITUDE',
y='LATITUDE',
gridsize=100,
cmap=plt.cm.Blues,
figsize=(10, 7))
<Axes: xlabel='LONGITUDE', ylabel='LATITUDE'>
2d density and contour plots¶
An alternative to the hexbin plots is to use density plots in two dimensions.
# Basic 2D density plot
plt.subplots(figsize=(20, 15))
# We take a sample, because density plots take a long time to compute
# and a sample is typically as good as the full dataset
sample = cleandf.sample(10000)
sns.kdeplot(
x=sample.LONGITUDE,
y=sample.LATITUDE,
gridsize=100, # controls the resolution
cmap=plt.cm.rainbow, # color scheme
shade= True, # whether to have a density plot (True), or just the contours (False)
alpha=0.5,
shade_lowest=False,
n_levels=50 # How many contours/levels to have
)
<Axes: xlabel='LONGITUDE', ylabel='LATITUDE'>
# Basic 2D contour plot
plt.subplots(figsize=(20, 15))
# We take a sample, because density plots take a long time to compute
# and a sample is typically as good as the full dataset
sample = cleandf.sample(10000)
sns.kdeplot(
x=sample.LONGITUDE,
y=sample.LATITUDE,
gridsize=100,
cmap=plt.cm.rainbow,
shade=False,
shade_lowest=False,
n_levels=25)
<Axes: xlabel='LONGITUDE', ylabel='LATITUDE'>
Combining Plots¶
So far, we examined how to create individual plots. We can even combine multiple plots together, using the ax parameter. So, let's say that we want to combine the scatter plots with the contour plot above:
sample = cleandf.sample(10000)
scatterplot = cleandf.plot(
kind='scatter',
x='LONGITUDE',
y='LATITUDE',
figsize=(20, 15),
s=0.5,
alpha=0.1)
sns.kdeplot(
x=sample.LONGITUDE,
y=sample.LATITUDE,
gridsize=100,
cmap=plt.cm.rainbow,
shade=False,
shade_lowest=False,
n_levels=20,
alpha=1,
ax=scatterplot)
<Axes: xlabel='LONGITUDE', ylabel='LATITUDE'>
kNN and Performance Measures¶
kNN Recap¶
The K-Nearest Neighbors (KNN) algorithm assumes that similar things are near to each other. With this assumption, in the K-Nearest Neighbors algorithm, in order to classify a point, we measure the distance (e.g. Euclidean distance) to the nearest k instances of the training set, and let them vote. K is typically chosen to be an odd number.
The KNN algorithm is very useful when there are non-linear decision boundaries. For example, consider the image below, displaying whether there is vegetation depending on latitude and longitude. A logistic regression would split our plane into two and thus would not be able to correctly predict that vegetation data points are located in the top right and bottom left quadrants. However, KNN classifiers would perform much better since vegetation (and non-vegetation) data points are grouped in clusters.
Note that the algorithm can be used for both classification and regression. You can read The Basics: KNN for classification and regression for intuition on how KNN can be applied for regression.
Distance metric ¶
As mentioned above, the KNN algorithm relies on the notion of distance between observations. Which distance? There are several possibilities, the most popular one being the Euclidean distance:
- Euclidean distance, also known as L2 norm. In a plane, it is the shortest distance - . Imagine we have $d$ (real-valued) features and we wish to calculate the distance between two observations $\boldsymbol{x_{1*}}=(x_{11}, ..., x_{1d})$ and $\boldsymbol{x_{2*}}=(x_{21}, ..., x_{2d})$, the Euclidean distance will be:
The Euclidean distance is useful in low dimension, it does not work well in high dimensions and for categorical variables. It also ignores the similarity between features since each feature is treated as totally different from all the other features.
- Manhattan distance, also known as L1 norm or "Taxicab". The idea is to travel the space the same way taxis would navigate the streets in a city like the island of Manhattan, known for its grid plan:
Manhattan distance is favored over Euclidean distance when we have many features (see for instance, Aggarwal, Hinneburg, & Keim paper On the Surprising Behavior of Distance Metrics in High Dimensional Space).
- Minkowski distance or $L_p$ distance generalizes the Euclidean and Manhattan distance:
For $p=1$, we get the Manhattan distance. For $p=2$, we get the Euclidean distance. For $p$ reaching $\infty$, we have $d_\infty= \min_j |x_{1j}-x_{2j}| $
Example¶
We illustrate kNN with a simple synthetic data set.
# Import
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Customize plots
%matplotlib inline
sns.set_theme(style="white")
plt.style.use('grayscale')
# Supress warnings
import warnings
warnings.filterwarnings('ignore')
The code below generates 16 points in the plane [0,1]. Points with low values of x1 and x2 are associated with class 0 and points with high values of x1 and x2 are associated with class 1.
# Create Data
data = {"x1":[0, 0.4, 0.15, 0.05, 0.4, 0.20, 0, 0.45, 1, 0.85, 0.9, 0.7, 0.65, 0.95, 1, 0.8],
"x2":[0.2, 0.35, 0, 0.10, 0.4, 0.25, 0.40, 0.35, 0.85, 0.95, 1, 0.65, 0.75, 0.9, 0.9, 0.95],
"y":[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]}
data = pd.DataFrame(data)
data
| x1 | x2 | y | |
|---|---|---|---|
| 0 | 0.00 | 0.20 | 0 |
| 1 | 0.40 | 0.35 | 0 |
| 2 | 0.15 | 0.00 | 0 |
| 3 | 0.05 | 0.10 | 0 |
| 4 | 0.40 | 0.40 | 0 |
| 5 | 0.20 | 0.25 | 0 |
| 6 | 0.00 | 0.40 | 0 |
| 7 | 0.45 | 0.35 | 0 |
| 8 | 1.00 | 0.85 | 1 |
| 9 | 0.85 | 0.95 | 1 |
| 10 | 0.90 | 1.00 | 1 |
| 11 | 0.70 | 0.65 | 1 |
| 12 | 0.65 | 0.75 | 1 |
| 13 | 0.95 | 0.90 | 1 |
| 14 | 1.00 | 0.90 | 1 |
| 15 | 0.80 | 0.95 | 1 |
We also have 3 new points for which we do not know the class.
We want to build a model to find out to which class (0 or 1) these 3 point belong to.
# New points
p = pd.DataFrame({"name":["p1", "p2", "p3"], "x1":[0.15, 0.75, 0.5],
"x2":[0.35, 0.8, 0.6]})
p
| name | x1 | x2 | |
|---|---|---|---|
| 0 | p1 | 0.15 | 0.35 |
| 1 | p2 | 0.75 | 0.80 |
| 2 | p3 | 0.50 | 0.60 |
First we plot our dataset with the x1 values on the horizontal axis and the x2 values on the vertical axes. We color points according to the y target variable, which only takes values 0 (red) and 1 (blue).
The new points are marked by an orange x marker.
# Plot
data.plot.scatter("x1", "x2", c="y", colormap="coolwarm_r", figsize=(7, 5))
plt.scatter(p.x1, p.x2, c="orange", marker="x")
for point in p.values:
plt.text(point[1]+0.01, point[2], point[0])
The two classes can be identified on the above scatter plot. In addition, p1 seems to belong to class 0, p2 to class 1. The class assignment for p3 is not so clear.
Below we classify the new points using the kNN algorithm with different k (i.e. the number of neighboors we consider for the class voting when assigning a class).
# Select X and y
X = data[["x1", "x2"]]
y = data["y"]
First build a simple model using the sklearn KNeighborsClassifier.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X, y)
y_pred = knn.predict(X)
y_pred
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])
We can specify various parameters for the kNN method KNeighborsClassifier():
n_neighbors= number of neighboring observations to usep= determines the distance/similarity metric ("p" refers to the Minkowski distance).- When p = 1, the Manhattan distance (l1-norm) is used,
- When p = 2 (default value), the Euclidean_distance (l2-norm) is used.
weights= determines how to weigh the neighboring observations.- When set to
uniform(default value): uniform weights. All points in each neighborhood are weighted equally. - When set to
distance: weight points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
- When set to
Please refer to the documentation for the full list of parameters and their meaning.
knn
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
But let's now see what class kNN would predict for our 3 points with unknown labels for different values of k.
# KNN plot
fig, ax = plt.subplots(1, 3, figsize=(18,5))
i = 0
for k in [1, 3, 5]:
model = KNeighborsClassifier(n_neighbors=k).fit(X,y)
pred = model.predict(p[["x1", "x2"]])
ax[i].scatter(data.x1, data.x2, c=data.y, cmap="coolwarm_r")
ax[i].scatter(p.x1, p.x2, c=pred, cmap="coolwarm_r", marker="x")
ax[i].set_title("KNN with k = " + str(k))
i += 1
For k = 1 and k = 3, p3 belongs to class 0 while it belongs to class 1 for k = 5.
Exercise: Diabetes Classification¶
We will work on the diabetes dataset that contains patient attributes (e.g. age, glucose, ...) and information on whether the patient is diagnosed with diabetes (0 meaning "no", 1 meaning "yes"). The goal is to learn a model that predicts whether a (new) patient has diabetes given based on their individual characteristics (the set of patient attributes). This is a classification task and you can use the kNN classifier.
Data Scaling ¶
The kNN approach (as many other ML approaches) is sensitive to the ranges of the input features. When we have a dataset with features that have very distinct ranges, we might get biased results. We want the features to be in the same or similar range.
We therefore need to scale the data. It involves transforming all values for a specific attribute so that they fall within a small specified range. You can use the StandardScaler(), (Documentation) MinMaxScaler() (Documentation) or others for normalization.
In this example scale the train AND test data using the MinMaxScaler().
IMPORTANT: When you scale the train data, you need to do the same modification to the test data. In other words, you train your scaler on your training set, and apply the same transformation to the training and test set.
1. Model fitting and performance evaluation¶
First, split the data into testsize 20% and trainsize 80% by using train_test_split (Documentation) of sklearn already imported for you. Then perform classification using a k-NN classifier with k = 5.
Calculate accuracy, recall, precision and f1-score for your classifier and plot the confusion matrix
to analyze the performance of the model.
# Import additional libraries
# data splitting
from sklearn.model_selection import train_test_split
# data scaling
from sklearn.preprocessing import MinMaxScaler
# performance measures
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix,f1_score,precision_score,recall_score,accuracy_score,average_precision_score
# classifier
from sklearn.neighbors import KNeighborsClassifier
# set a random seed
np.random.seed(1)
df = pd.read_csv('DiabetesDataset.csv')
print(df.info)
# keep the patient characteristics as inputs x and the diabetes as target y
x = df.drop(columns=['Diabetes'])
y = df['Diabetes'].values
labels = ["No Diabetes", "Diabetes"]
#### START YOUR SOLUTION HERE ####
# Split data into training and test dataset
trainX, testX, trainy, testy = train_test_split(x, y, test_size=0.2)
# Define the data scaler
scaler = MinMaxScaler()
# Fit and transform the training set
trainX = scaler.fit_transform(trainX)
# Transform the test set
testX = scaler.transform(testX)
# Fit kNN model with k=5 to the training data
model = KNeighborsClassifier(n_neighbors=5).fit(trainX, trainy)
# # Get predictions on test set
pred = model.predict(testX)
# Compute the performance measures listed in the text above
conf = confusion_matrix(testy, pred)
acc = accuracy_score(testy, pred)
rec = recall_score(testy, pred)
prec = precision_score(testy, pred)
f1 = f1_score(testy, pred)
# Print the values of all performance measures except the confusion matrix
print( "Performance measurements", "\n",
"accuracy : ", round(acc,3),"\n",
"recall : ", round(rec,3), "\n",
"precision : ", round(prec,3),"\n",
"f1-score : ", round(f1,3))
# Display confusion matrix using a heatmap
sns.heatmap(conf,
annot=True,
fmt='d',
cbar=False,
cmap="coolwarm_r",
xticklabels=labels,
yticklabels=labels,
linewidth = 1)
plt.title("Confusion Matrix")
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
#### END YOUR SOLUTION HERE ####
<bound method DataFrame.info of Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
.. ... ... ... ... ... ...
763 10 101 76 48 180 32.9
764 2 122 70 27 0 36.8
765 5 121 72 23 112 26.2
766 1 126 60 0 0 30.1
767 1 93 70 31 0 30.4
DiabetesPedigreeFunction Age Diabetes
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
.. ... ... ...
763 0.171 63 0
764 0.340 27 0
765 0.245 30 0
766 0.349 47 1
767 0.315 23 0
[768 rows x 9 columns]>
Performance measurements
accuracy : 0.818
recall : 0.673
precision : 0.787
f1-score : 0.725
2. Performance Curves¶
Compute the values necessary for plotting the area under the ROC and Precision-Recall curves and then plot and inspect the curves.
# Predict probabilities for the test set
probs = model.predict_proba(testX)
# Keep the Probabilities of the positive class only
probs = probs[:, 1]
# Function for plotting the ROC curve
def plot_roc_curve(fpr, tpr):
plt.plot(fpr, tpr, color='orange', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--', label = 'random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()
# Function for plotting the Precision-Recall curve
def plot_rpc(recall, precision):
plt.plot(recall, precision, color='orange', label='RPC')
plt.ylabel('Precision')
plt.xlabel('Recall = True Positive Rate')
plt.title('Recall-Precision Curve')
plt.legend()
plt.show()
#### START YOUR SOLUTION HERE ####
# Plot ROC curve (check out the function roc_curve)
fpr, tpr, thresholds = roc_curve(testy, probs)
plot_roc_curve(fpr, tpr)
# Compute the Area Under the ROC Curve (AUC) - the ROC AUC score
auc = roc_auc_score(testy, probs)
print("AUC: " , round(auc, 3))
# Plot Precision-Recall curve
precision, recall, _ = precision_recall_curve(testy, probs)
plot_rpc(recall, precision)
# Compute average precision - Precision-Recall AUC
average_precision = average_precision_score(testy, probs)
print("Average Precision: ", round(average_precision, 3))
#### END YOUR SOLUTION HERE ####
AUC: 0.838
Average Precision: 0.723
3. Assignment: Different Values of k¶
Now fit two additional k-NN classifies to the same dataset with k values 1 and n (number of training samples), respectively. Compute the accuracies and plot the corresponding confusion matrix to analyze the prediction results for each model.
# Perform a k-NN on the given dataset and plot the confusion matrix
# compute number of samples
n = len(trainy)
for k in [1, n]:
#### START YOUR SOLUTION HERE ####
# Fit a kNN classifier
knn = KNeighborsClassifier(n_neighbors=k).fit(trainX, trainy)
# Compute the predictions on the test data using the trained model
pred = knn.predict(testX)
# Compute accuracy
acc_score = accuracy_score(testy, pred)
# Compute the confusion matrix
conf = confusion_matrix(testy, pred)
# Plot the confusion matrix using a heatmap
sns.heatmap(conf,
annot=True,
fmt='d',
cbar=False,
cmap="coolwarm_r",
xticklabels=labels,
yticklabels=labels,
linewidth=1)
plt.title('Accuracy score for k={} : {}'.format(k, round(acc_score, 3)))
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
#### END YOUR SOLUTION HERE ####
# Import standard libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt
%matplotlib inline
# Import to load arff file from url
from scipy.io import arff
import urllib.request
import io
# Sklearn import
from sklearn.model_selection import train_test_split # Splitting the data set
from sklearn.model_selection import KFold, cross_val_score # Cross validation
from sklearn.preprocessing import MinMaxScaler # Normalization
from sklearn.preprocessing import PolynomialFeatures # Polynomial features
from sklearn.preprocessing import LabelEncoder #Label encoding
from sklearn.preprocessing import OneHotEncoder # 1-hot encoding
from sklearn.linear_model import LinearRegression # Regression linear model
from sklearn.linear_model import Lasso # Lasso model
from sklearn.linear_model import Ridge # Ridge model
from sklearn.linear_model import LassoCV # Lasso with cross validation
from sklearn.linear_model import RidgeCV # Ridge with cross validation
from sklearn.linear_model import ElasticNet # ElasticNet model
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # Metrics for errors
Regression¶
Model¶
Suppose we have n observations of an outcome $\boldsymbol{y}$ and d associated features $\boldsymbol{x_1}$, $\boldsymbol{x_2}$, ... , $\boldsymbol{x_d}$ (note that $\boldsymbol{y}$, $\boldsymbol{x_1}$, ..., $\boldsymbol{x_d}$ are vectors):
| Outcome | Feature 1 | Feature 2 | ... | Feature d | |
|---|---|---|---|---|---|
| Observation 1 | $y_1$ | $x_{11}$ | $x_{12}$ | ... | $x_{1d}$ |
| Observation 2 | $y_2$ | $x_{21}$ | $x_{22}$ | ... | $x_{2d}$ |
| ... | ... | ... | ... | ... | ... |
| Observation n | $y_n$ | $x_{n1}$ | $x_{n2}$ | ... | $x_{nd}$ |
The goal of regression is to relate input feature variables to the outcome variable, to either predict outcomes for new observations and/or to understand the effect of the features on the output. For both goals, we need to find a function that approximates the output “well enough” given some inputs.
For instance, in the case of multivariate linear regression, for each observation, we have the predicted value $\hat{y_i}$: $$\hat{y_i}:=w_0 + w_1 x_{i,1} + w_2 x_{i,2} + ... + w_d x_{i,d}$$ where $w_0$ the intercept (bias term), and $w_1$, ... , $w_d$ the slope coefficients (i.e., weights) of each feature.
More generally, let $f$ be our model function, $\boldsymbol{w}=(w_0, w_1, ..., w_d)$ the vector of weights, and $\boldsymbol{X}=[\boldsymbol{x_1}$, ... , $\boldsymbol{x_d}]$ the matrix of feature variables. For all observations, we have, with $\boldsymbol{X_{i*}}$ the $i^{th}$ row:
$$\hat{y_i} := f(\boldsymbol{X_{i*}}, \boldsymbol{w})$$In our illustration, we have focused on a multivariate linear regression, but the formulation will be the same for more complex models, such as neural networks that are also functions mapping inputs to outputs, which we will see later in this course.
Now our objective is to find the predicted values $\hat{y_i}$ that are the closest to the observations $y_i$ of the available dataset. In other words, we want to minimize the errors $\epsilon_i = y_i - \hat{y_i}$. There are several possible techniques. Below, we present the simplest one, namely the least squares problem.
Least squares problem¶
The idea is to minimize the sum of squared residuals (aka RSS - Residual Sum of Squares):
$$ \min_\boldsymbol{w} \sum_{i=1}^n (y_i - \hat{y_i})^2 = \min_\boldsymbol{w} \sum_{i=1}^n (y_i - f(\boldsymbol{X_{i*}}, \boldsymbol{w}))^2 $$Graphically, for a simple linear regression, we minimize the area of the squares between our observation and our predicted value:
Source: Wikipedia - Coefficient of determination. Author: Orzetto
The coefficient of determination $R^2$ informs about the goodness of fit: $R^2= 1 -\frac{\color{blue}{RSS}}{\color{red}{TSS}} $
- Residual Sum of Squares: $\color{blue}{RSS=\sum_i (y_i - \hat{y_i})^2}$
- Total Sum of Squares: $\color{red}{TSS=\sum_i (y_i - \bar{y})^2}$
When $R^2=1$, then $RSS=0$, meaning all the errors are equal to zero, and the model gives "perfect" prediction.
When $R^2=0$, then $RSS=TSS$, hence our model is not more informative that taking the average of our observations.
The prediction errors will generally decrease with the complexity of the model (e.g. polynomial regression). But what could go wrong?
- The prediction error decreases but... there is a risk of overfitting: the model does not generalize well on unseen data!
Regularization ¶
The objective of regularization is to address overfitting.
The general idea is to put an additional constraint - or penalty - on our parameters $\boldsymbol{w}$, instead of focusing only on optimizing the errors. Here is the new problem formulation:
$$ \min_\boldsymbol{w} L(\boldsymbol{y}, \boldsymbol{X}, \boldsymbol{w}) + \lambda R(\boldsymbol{w}) $$- $L(\boldsymbol{y}, \boldsymbol{X}, \boldsymbol{w})$ is the loss/cost function. It measures the prediction error (on a given dataset).
- For instance, we can use the least square loss function:
$ L(\boldsymbol{y}, \boldsymbol{X}, \boldsymbol{w}) = \frac{1}{n} \sum_i^n (y_i - f(\boldsymbol{X_{i*}}, \boldsymbol{w}))^2 $
- $\lambda$ is the penalty term
- $R(\boldsymbol{w})$ is the regularization function that constrains the model, typically penalizing the model parameters $w_1$, ..., $w_d$ (weights escluding the bias term).
What regularization function should we use? Below are some common examples...
LASSO regression, standing for "Least Absolute Selection and Shrinkage", is using the $L_1$-norm (absolute value norm of the parameters as regularization function: $$ R(\boldsymbol{w})= \sum_{j=1}^d |w_j| $$
- Pros
- Force most entries of $\boldsymbol{w}$ to be 0. In other words, there is feature selection effect, and the technique is preferred when $\boldsymbol{w}$ is expected to be sparse
- Cons
- Arbitrary selection among highly correlated variables
- Selects at most $n$ features when more features than observations ($d > n$)
- Features with small $w_j$ values will be forced to zero
Ridge regression is using the square of the $L_2$-norm (Euclidean norm of the parameters as regularization function: $$R(\boldsymbol{w})= \sum_{j=1}^d w_j^2 $$
- Pros
- More stable solution (shrink parameters estimate). This method is thus preferred when $\boldsymbol{w}$ is expected to take small values
- Cons
- Less sensitive to data
- $\boldsymbol{w}$ is typically still not sparse (no explicit feature selection)
Elastic net regression is using a linear combination of Ridge and Lasso: $$ R(\boldsymbol{w})= \lambda_1 \sum_{j=1}^d |w_j| + \lambda_2 \sum_{j=1}^d w_j^2 $$
- Pros
- Ridge term makes the problem convex (unique solution)
- Overcome some of the limitations of LASSO: can select group of highly correlated variables and more than $n$ features when more features than observations ($d > n$)
Why LASSO leads to feature selection?¶
In the previous section, we have state that LASSO regression forces weights to zero, hence doing feature selection, while Ridge was shrinking parameters. Why is that? The answer lies in the shape of the functions. Mathematically, the ridge regularization function is convex while LASSO is not.
What does it imply? Let's see graphically, in a model with two features: $$ \hat{y_i}= w_1 x_{i,1} + w_2 x_{i,2} $$
The least square loss function is quadratic: $$L(w_1, w_2) = \sum_i^n (y_i - w_1 x_{i,1} - w_2 x_{i,2})^2$$
Hence, plotted in a plan, our "indifference curves" (i.e., the curves such that the loss function is equal to a given value) would look like elliptical contours - see figure below, in red. Without regularization, our optimum would be located at the center of the ellipse.
What happens when we add a regularization term? We transform our minimization problem. Mathematically, adding the regularization term is equivalent to adding a constraint on the weights:
- LASSO: $|w_1| + |w_2| \leq t$
- Ridge: $w_1^2 + w_2^2 \leq t$
Graphically, LASSO constraint looks like a diamond (cyan) while Ridge constraint is a disk (green).
When we relaxed the constraints, the constrained regions (diamond and disk) get bigger, and can eventually hit the center of the ellipse. In such case, the optimum weights are the one obtained without regularization.
Otherwise, the optimum weights are obtained at the intersection of the elliptical contours and of the constrained regions. With LASSO, this intersection will happen at one of the corners of the diamond, i.e., when one of the weight is equal to zero. With Ridge, the intersection will happen at one point of the circle: while the values of the weights are shrunk, they will (almost) never be exactly zero.
Source: Ridge and Lasso Regression: L1 and L2 Regularization, Saptashwa Bhattacharyya, Towards Data Science
Solving our model: learning parameters via gradient descent¶
To find the solution of our problem, we use numerical optimization: we search the minimum by iteration. Recall the optimization problem we want to solve: minimize the prediction errors (loss function), with a constraint on our parameters (regularization function).
$$ \min_\theta L(\boldsymbol{y}, \boldsymbol{X}, \boldsymbol{w}) + \lambda R(\boldsymbol{w}) $$We call $J$ our objective function (also called cost function) $J:= L + \lambda R$
One possible numerical method to solve this problem is the gradient descent. It is an optimization algorithm with iterative updating rule:
- We first start with an initial value $\boldsymbol{w^0}=(w^0_0, w^0_1,...,w^0_d)$, selected at random or a best guess
- We update our parameters: $\boldsymbol{w^{k+1}}=\boldsymbol{w^{k}}+ \gamma \nabla (J[\boldsymbol{w^{k}}])$
- $\gamma$ is the learning rate
- $\nabla (J[\boldsymbol{w^{k}}])$ is the gradient, i.e., the derivatives of $J$ with respect to $w_0$, $w_1$, ..., $w_d$; and evaluated at $\boldsymbol{w^{k}}$
- We continue until a given convergence criteria is obtained (fixed point)
There are many flavours of this methods, but they often consist in tweaking the update rule.
Implementation¶
We will use the python library sklearn (for more details see documentation) to implement various regression techniques, including the one discussed above: simple (univariate) linear regression, multivariate linear regression, polynomial linear regression, lasso, ridge, scaling, encoding, cross-validation.
Remember, you train your model (learn parameters) using your training set, and then evaluate its generalization performance on the (unseen) test set.
Hence, after loading and cleaning our dataset, we will follow these steps:
- Preprocessing: split our dataset between training set (80% of observations) and test set (20% of observations), scaling, encoding
- Create and fit our model, i.e., learn the parameters using the training set
- Predict new observations and evaluate our model using the test set
Load and discover the dataset¶
In this section, we will use the weather dataset, which contains weather data e.g., temperature, wind speed, humidity, rain in Canberra between November 2007 and October 2008. Let's load and explore our dataset.
#Load the dataset
weather = pd.read_csv('weather.csv').drop_duplicates().dropna() # drop duplicates and NaN values
# Display a sample of the data
display(weather.head())
#Print the data types
print(weather.dtypes)
| Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RISK_MM | RainTomorrow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2007-11-01 | Canberra | 8.0 | 24.3 | 0.0 | 3.4 | 6.3 | NW | 30.0 | SW | ... | 29 | 1019.7 | 1015.0 | 7 | 7 | 14.4 | 23.6 | No | 3.6 | Yes |
| 1 | 2007-11-02 | Canberra | 14.0 | 26.9 | 3.6 | 4.4 | 9.7 | ENE | 39.0 | E | ... | 36 | 1012.4 | 1008.4 | 5 | 3 | 17.5 | 25.7 | Yes | 3.6 | Yes |
| 2 | 2007-11-03 | Canberra | 13.7 | 23.4 | 3.6 | 5.8 | 3.3 | NW | 85.0 | N | ... | 69 | 1009.5 | 1007.2 | 8 | 7 | 15.4 | 20.2 | Yes | 39.8 | Yes |
| 3 | 2007-11-04 | Canberra | 13.3 | 15.5 | 39.8 | 7.2 | 9.1 | NW | 54.0 | WNW | ... | 56 | 1005.5 | 1007.0 | 2 | 7 | 13.5 | 14.1 | Yes | 2.8 | Yes |
| 4 | 2007-11-05 | Canberra | 7.6 | 16.1 | 2.8 | 5.6 | 10.6 | SSE | 50.0 | SSE | ... | 49 | 1018.3 | 1018.5 | 7 | 7 | 11.1 | 15.4 | Yes | 0.0 | No |
5 rows × 24 columns
Date object Location object MinTemp float64 MaxTemp float64 Rainfall float64 Evaporation float64 Sunshine float64 WindGustDir object WindGustSpeed float64 WindDir9am object WindDir3pm object WindSpeed9am float64 WindSpeed3pm int64 Humidity9am int64 Humidity3pm int64 Pressure9am float64 Pressure3pm float64 Cloud9am int64 Cloud3pm int64 Temp9am float64 Temp3pm float64 RainToday object RISK_MM float64 RainTomorrow object dtype: object
Note that the dataset contains numerical variables (e.g., temperature, rainfall, humidity, pressure) and categorial variables (e.g., wind direction). In addition, we have weather data at 9am and 3pm. We will only work with values concerning 3pm for simplicity. Let's get some summary statistics:
# Select features of interest
weather3pm = weather.loc[:,['Temp3pm','Humidity3pm', 'Cloud3pm', 'Pressure3pm', 'WindSpeed3pm', 'WindDir3pm', 'Sunshine', 'Rainfall']]
# Summary statistics
display(weather3pm.describe())
# Correlation matrix
display(weather3pm.corr(numeric_only = True))
| Temp3pm | Humidity3pm | Cloud3pm | Pressure3pm | WindSpeed3pm | Sunshine | Rainfall | |
|---|---|---|---|---|---|---|---|
| count | 328.000000 | 328.000000 | 328.000000 | 328.000000 | 328.000000 | 328.000000 | 328.000000 |
| mean | 19.556402 | 44.003049 | 4.000000 | 1016.530793 | 18.185976 | 8.014939 | 1.440854 |
| std | 6.644311 | 16.605975 | 2.652101 | 6.469774 | 8.926759 | 3.506646 | 4.289427 |
| min | 5.100000 | 13.000000 | 0.000000 | 996.800000 | 4.000000 | 0.000000 | 0.000000 |
| 25% | 14.500000 | 32.000000 | 1.000000 | 1012.400000 | 11.000000 | 6.000000 | 0.000000 |
| 50% | 18.850000 | 42.500000 | 4.000000 | 1016.900000 | 17.000000 | 8.750000 | 0.000000 |
| 75% | 24.225000 | 54.000000 | 7.000000 | 1021.125000 | 24.000000 | 10.700000 | 0.200000 |
| max | 34.500000 | 93.000000 | 8.000000 | 1033.200000 | 52.000000 | 13.600000 | 39.800000 |
| Temp3pm | Humidity3pm | Cloud3pm | Pressure3pm | WindSpeed3pm | Sunshine | Rainfall | |
|---|---|---|---|---|---|---|---|
| Temp3pm | 1.000000 | -0.569348 | -0.181667 | -0.332099 | -0.239119 | 0.463721 | -0.089740 |
| Humidity3pm | -0.569348 | 1.000000 | 0.530715 | -0.047607 | 0.015860 | -0.760267 | 0.287244 |
| Cloud3pm | -0.181667 | 0.530715 | 1.000000 | -0.146235 | 0.011625 | -0.657198 | 0.134894 |
| Pressure3pm | -0.332099 | -0.047607 | -0.146235 | 1.000000 | -0.318008 | -0.024120 | -0.263710 |
| WindSpeed3pm | -0.239119 | 0.015860 | 0.011625 | -0.318008 | 1.000000 | 0.046140 | 0.058151 |
| Sunshine | 0.463721 | -0.760267 | -0.657198 | -0.024120 | 0.046140 | 1.000000 | -0.158062 |
| Rainfall | -0.089740 | 0.287244 | 0.134894 | -0.263710 | 0.058151 | -0.158062 | 1.000000 |
Linear regression¶
We will first implement a simple (univariate) linear regression. Our goal in this section will be to try to predict the temperature given the level of humidity:
$$\hat{Temperature_i} = w_0 + w_1 Humidity_i$$X = weather3pm[['Humidity3pm']]
y = weather3pm[['Temp3pm']]
Splitting the dataset ¶
We again use the train_test_split (Documentation) to split the available data into training and test set.
The training set will be used to retrieve the best values of the weights $w_0$ and $w_1$ according to a combination of input (humidity) and output (temperature) observations. The test set will be used to evaluate our model. Since our model will be trained on particular values we want to test its generalization ability on a new set of data (the test set).
The test size here is of 20% of the original data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)
Note that we control how the data shuffling is applied by providing a random state, in order to obtain reproducible output across multiple function calls.
Create and Fit model ¶
To predict the output variable we will use a simple linear regression, the module is called LinearRegression (Documentation). Here is the import line:
from sklearn.linear_model import LinearRegression
We follow three steps:
- Create a new
LinearRegressionmodel from sklearn - Fill the linear model from the X_train (feature) and the y_train data (target) using the
fit()function - Check the model accuracy using the
score()function, which returns the coefficient of determination $R^2$ of the prediction. The best possible score is 1 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.
# There are three steps to model something with sklearn
# 1. Set up the model
model = LinearRegression(fit_intercept= True)
# 2. Use fit
model.fit(X_train, y_train)
# 3. Check the score/accuracy
print("R\u00b2 Score of the model: ", round(model.score(X_train, y_train), 3))
R² Score of the model: 0.313
After fitting the model, we can easily retrieve the values of the different weights coefficients (the intercept, and the weight of each feature):
print("Intercept: ", model.intercept_[0])
print("Features coefficients (weigths): ", model.coef_.flatten()[0])
Intercept: 29.543003059769497 Features coefficients (weigths): -0.22661055784458156
The intercept corresponds to the value of $w_0$. There is only one coefficient, $w_1$ linked to the humidity feature. Since we have only one value for intercept and coefficients represented as arrays, we apply flatten() and [0].
Prediction and Evaluation ¶
Once the model is trained, we can use the predict() function to predict the values of the test set using X_test. This prediction can be compared to the truth value, i.e., y_test. Let's try with one value of the test set. Note that our model takes a matrix as inputs (X matrix), so even if we want to do a prediction for a scalar value we should use [[...]].
humidity_test = X_test.iloc[0].values[0]
temperature_predicted = model.predict([[humidity_test]]).flatten()[0]
temperature_test = y_test.iloc[0].values[0]
print(f"Prediction/observed temperature for humidity {humidity_test}: {temperature_predicted:.1f}°C vs {temperature_test}°C")
Prediction/observed temperature for humidity 28: 23.2°C vs 27.0°C
/Users/bogo/opt/anaconda3/envs/MachLe310/lib/python3.10/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(
To better understand how the prediction and actual values differ, we can plot the predictions (line) and the true values from the test set (dots). It is more interesting to predict from the test set because our model is not trained on these values unlike the train set.
# Model prediction from X_test
predictions = model.predict(X_test)
# Plot the prediction (the line) over the true value (the dots)
plt.scatter(X_test, y_test)
plt.plot(X_test, predictions, 'r')
plt.title("Humidity level against temperature")
plt.xlabel('Humidity level')
plt.ylabel('Temperature °C')
plt.show()
We can compare the error of our model by using some metrics like the MAE (mean absolute error), MSE (mean squared error) or coefficient of determination $R^2$ score. Sklearn offers some nice modules to compute these measures (MAE, MSE, $R^2$). Here is the import line:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
These metrics takes the y_test values and the predictions as arguments. Basically it will analyse how far the prediction is from the true value. Using these metrics is very helpful when comparing the performance of model.
# Compute the MAE, the MSE and the R^2
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"MAE: {mae:0.2f}")
print(f"MSE: {mse:0.2f}")
print(f"R\u00b2: {r2:0.2f} " )
MAE: 4.27 MSE: 24.44 R²: 0.37
It is also interesting to compare the results of these metrics between the data from the test set and those from the train set to see whether our model generalizes well:
predictions_train = model.predict(X_train)
mae_train = mean_absolute_error(y_train, predictions_train)
mse_train = mean_squared_error(y_train, predictions_train)
r2_train = r2_score(y_train, predictions_train)
print(f"MAE test set: {mae:0.2f}; MAE train set: {mae_train:0.2f};")
print(f"MSE test set: {mse:0.2f}; MSE train set: {mse_train:0.2f};")
print(f"R\u00b2 test set: {r2:0.2f}; R\u00b2 train set: {r2_train:0.2f};" )
MAE test set: 4.27; MAE train set: 4.77; MSE test set: 24.44; MSE train set: 31.08; R² test set: 0.37; R² train set: 0.31;
Remember, the higher the $R^2$ value, the better the fit. In this case, the test data yields a higher coefficient as well as lower mean absolute and mean squared errors. While it might seem a bit counterintuitive, one possible explanation lies in the observations selected when we split our dataset into training/test set and the size of the dataset. One remedy would be to rely on cross validation.
Multivariate linear regression¶
We will now apply the same method to several features, namely, humidity, pressure, wind speed, wind direction, sunshine, rainfall, and cloud data to predict the temperature, still at 3pm.
X = weather[['Humidity3pm', 'Cloud3pm', 'Pressure3pm', 'WindSpeed3pm', 'WindDir3pm', 'Sunshine', 'Rainfall']]
y = weather[['Temp3pm']]
Splitting dataset ¶
We apply the same procedure as before:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)
Preprocessing: encoding categorical variables¶
The feature 'WindDir3pm' is a categorical variable. To use it in our model, we need to encode it. Here, we will use a label encoding, using the sklearn module LabelEncoder. As an alternative, we could use 1-hot encoding, with the sklearn module OneHotEncoder. Here are the import lines:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
Note: You should encore your data after splitting the dataset to avoid data leakage (train-test contamination), first transforming the training set and then the test set based on the encoding maps from train data.
print(X_train[['WindDir3pm']])
# Extract the column of interest
wind_dir_3pm = X_train[['WindDir3pm']].values.ravel()
wind_dir_3pm_test = X_test[['WindDir3pm']].values.ravel()
#Define the encoder
le = LabelEncoder()
#Fit the encoder
le.fit(wind_dir_3pm)
#Transform the train and the test set
X_train = X_train.assign(WindDir3pm=le.transform(wind_dir_3pm))
X_test = X_test.assign(WindDir3pm=le.transform(wind_dir_3pm_test))
print(X_train[['WindDir3pm']])
WindDir3pm
1 W
333 NW
8 ENE
232 WNW
101 W
.. ...
361 NW
204 SSW
119 SE
47 E
179 WNW
[262 rows x 1 columns]
WindDir3pm
1 13
333 7
8 1
232 14
101 13
.. ...
361 7
204 11
119 9
47 0
179 14
[262 rows x 1 columns]
Rescaling¶
Next, we rescale our data.
Note: Generally you should normalize the data right after splitting the dataset. The normalization is important here to reduce the variance of our model and get better results.
We can use the sklearn MinMaxScaler module to normalize the data. This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one. Here is the import line:
from sklearn.preprocessing import MinMaxScaler
#Define the scaler
scaler = MinMaxScaler()
#Fit the scaler
scaler.fit(X_train)
#Transform the training and the test set
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
## Note that these two steps can be merged into one (only for the training set)
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)
Create and Fit model ¶
We follow the same steps as before:
# 1. Set up the model
model = LinearRegression()
# 2. Use fit
model.fit(X_train, y_train)
# 3. Check the score/accuracy
print("R\u00b2 Score of the model: ", round(model.score(X_train, y_train), 3))
# 4. Print the coefficients of the linear model
print("Intercept: ", model.intercept_[0])
model_coeff = pd.DataFrame(model.coef_.flatten(),
index=['Humidity3pm', 'Cloud3pm', 'Pressure3pm', 'WindSpeed3pm', 'WindDir3pm', 'Sunshine', 'Rainfall'],
columns=['Coefficients multivariate model'])
model_coeff # Get the coefficients, w
R² Score of the model: 0.621 Intercept: 37.93596617948596
| Coefficients multivariate model | |
|---|---|
| Humidity3pm | -16.236408 |
| Cloud3pm | 2.993911 |
| Pressure3pm | -19.235234 |
| WindSpeed3pm | -13.840040 |
| WindDir3pm | -2.875112 |
| Sunshine | 4.779577 |
| Rainfall | -2.598842 |
The coefficient values inform us about the relative importance of each feature for our prediction.
Prediction and evaluation ¶
Finally, we evaluate our model performance following the same procedure as before:
# Predict:
predictions = model.predict(X_test)
# Compute the MAE, the MSE and the R^2 on the test set
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
# Compute the MAE, the MSE and the R^2 on the training set
predictions_train = model.predict(X_train)
mae_train = mean_absolute_error(y_train, predictions_train)
mse_train = mean_squared_error(y_train, predictions_train)
r2_train = r2_score(y_train, predictions_train)
print(f"MAE test set: {mae:0.2f}; MAE training set: {mae_train:0.2f};")
print(f"MSE test set: {mse:0.2f}; MSE training set: {mse_train:0.2f};")
print(f"R\u00b2 test set: {r2:0.2f}; R\u00b2 training set: {r2_train:0.2f};" )
MAE test set: 3.22; MAE training set: 3.39; MSE test set: 15.86; MSE training set: 17.13; R² test set: 0.59; R² training set: 0.62;
The mean absolute error and mean squared error in our multivariate analysis are lower than in the univariate case: as expected, adding more complexity (features) seemed to have improved our prediction.
Note that you should not use the $R^2$ to compare several models since the indicator is sensitive to the number of features. Instead, you can for instance use the Adjusted $R^2$.
Polynomial linear regression¶
Polynomial regression is very powerful as a sufficiently higher degree polynomial is flexible enough to fit any function. We are using the module PolynomialFeatures to preprocess our data (Documentation):
from sklearn.preprocessing import PolynomialFeatures
The function PolynomialFeatures generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
# We will use a degree 2
poly = PolynomialFeatures(2)
# Transform our training and test set
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Feature name:
X_poly_features = poly.get_feature_names_out(['Humidity3pm', 'Cloud3pm', 'Pressure3pm', 'WindSpeed3pm', 'WindDir3pm', 'Sunshine', 'Rainfall'])
print(X_poly_features)
['1' 'Humidity3pm' 'Cloud3pm' 'Pressure3pm' 'WindSpeed3pm' 'WindDir3pm' 'Sunshine' 'Rainfall' 'Humidity3pm^2' 'Humidity3pm Cloud3pm' 'Humidity3pm Pressure3pm' 'Humidity3pm WindSpeed3pm' 'Humidity3pm WindDir3pm' 'Humidity3pm Sunshine' 'Humidity3pm Rainfall' 'Cloud3pm^2' 'Cloud3pm Pressure3pm' 'Cloud3pm WindSpeed3pm' 'Cloud3pm WindDir3pm' 'Cloud3pm Sunshine' 'Cloud3pm Rainfall' 'Pressure3pm^2' 'Pressure3pm WindSpeed3pm' 'Pressure3pm WindDir3pm' 'Pressure3pm Sunshine' 'Pressure3pm Rainfall' 'WindSpeed3pm^2' 'WindSpeed3pm WindDir3pm' 'WindSpeed3pm Sunshine' 'WindSpeed3pm Rainfall' 'WindDir3pm^2' 'WindDir3pm Sunshine' 'WindDir3pm Rainfall' 'Sunshine^2' 'Sunshine Rainfall' 'Rainfall^2']
Now we proceed as before, performing a linear regression:
# Set up the model
model_poly = LinearRegression(fit_intercept=False) # we don't need fit intercept since polynomial features function add a column of ones to the data
# Fit
model_poly.fit(X_train_poly, y_train)
# Check the score/accuracy
print("R\u00b2 Score of the model: ", round(model_poly.score(X_train_poly, y_train), 3))
# Print the coefficients of the linear model
model_coeff = pd.DataFrame(model_poly.coef_.flatten(),
index=X_poly_features,
columns=['Coefficients polynomial model'])
model_coeff # Get the coefficients, w
R² Score of the model: 0.718
| Coefficients polynomial model | |
|---|---|
| 1 | 70.818534 |
| Humidity3pm | -34.497306 |
| Cloud3pm | -14.110065 |
| Pressure3pm | -58.905587 |
| WindSpeed3pm | -44.953044 |
| WindDir3pm | -16.625414 |
| Sunshine | -18.112180 |
| Rainfall | -2.382792 |
| Humidity3pm^2 | 5.297209 |
| Humidity3pm Cloud3pm | 3.811526 |
| Humidity3pm Pressure3pm | 28.136987 |
| Humidity3pm WindSpeed3pm | 12.537486 |
| Humidity3pm WindDir3pm | -8.755339 |
| Humidity3pm Sunshine | -7.294708 |
| Humidity3pm Rainfall | 8.673520 |
| Cloud3pm^2 | -1.263658 |
| Cloud3pm Pressure3pm | 6.184263 |
| Cloud3pm WindSpeed3pm | 11.543304 |
| Cloud3pm WindDir3pm | 1.151535 |
| Cloud3pm Sunshine | 14.325605 |
| Cloud3pm Rainfall | 11.561408 |
| Pressure3pm^2 | 10.339247 |
| Pressure3pm WindSpeed3pm | 16.659560 |
| Pressure3pm WindDir3pm | 18.404142 |
| Pressure3pm Sunshine | -1.986992 |
| Pressure3pm Rainfall | -0.245664 |
| WindSpeed3pm^2 | 5.415717 |
| WindSpeed3pm WindDir3pm | 0.445015 |
| WindSpeed3pm Sunshine | 15.305645 |
| WindSpeed3pm Rainfall | -19.222073 |
| WindDir3pm^2 | 4.630299 |
| WindDir3pm Sunshine | 0.635540 |
| WindDir3pm Rainfall | 20.616689 |
| Sunshine^2 | 13.589054 |
| Sunshine Rainfall | -5.688230 |
| Rainfall^2 | -22.396915 |
Finally, we evaluate the performance of our model:
# Predict:
predictions = model_poly.predict(X_test_poly)
# Compute the MAE, the MSE and the R^2 on the test set
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
# Compute the MAE, the MSE and the R^2 on the training set
predictions_train = model_poly.predict(X_train_poly)
mae_train = mean_absolute_error(y_train, predictions_train)
mse_train = mean_squared_error(y_train, predictions_train)
r2_train = r2_score(y_train, predictions_train)
print(f"MAE test set: {mae:0.2f}; MAE training set: {mae_train:0.2f};")
print(f"MSE test set: {mse:0.2f}; MSE training set: {mse_train:0.2f};")
print(f"R\u00b2 test set: {r2:0.2f}; R\u00b2 training set: {r2_train:0.2f};" )
MAE test set: 2.94; MAE training set: 2.90; MSE test set: 13.84; MSE training set: 12.75; R² test set: 0.64; R² training set: 0.72;
The mean absolute and mean square errors on the test set decreased.
However, beware, that complex models tend to overfit.
For instance, if we were to use polynomial features with a degree 3, the mean absolute and mean square errors on the training set would decrease, but the errors on the test set would dramatically increase - and the $R^2$ on the test set would even be negative. Try it!
To avoid such issue, we can implement some regularization techniques.
Regularization ¶
We will now implement some regularization techniques discussed above, in combination of our previous polynomial linear regression.
Lasso¶
We can use the sklearn Lasso module to implement a Lasso regularization (Documentation). Here is the import line:
from sklearn.linear_model import Lasso
The procedure is the same as before. Since we already split our dataset and preprocessed our training and test sets, we can create the model, fit it, and then evaluate its performance.
When creating the model, we can specify the penalty term, alpha as an argument of Lasso():
# Set up the model
lasso_model = Lasso(alpha=0.2, fit_intercept=False)
# Use fit
lasso_model.fit(X_train_poly, y_train)
# Check the score/accuracy
print("R\u00b2 Score of the model: ", round(lasso_model.score(X_train_poly, y_train), 3))
# Print the coefficients of the linear model
model_coeff = pd.DataFrame(lasso_model.coef_.flatten(),
index=X_poly_features,
columns=['Coefficients Lasso model'])
model_coeff
R² Score of the model: 0.424
| Coefficients Lasso model | |
|---|---|
| 1 | 16.480539 |
| Humidity3pm | -3.131065 |
| Cloud3pm | 2.723063 |
| Pressure3pm | -0.000000 |
| WindSpeed3pm | -2.157586 |
| WindDir3pm | 0.000000 |
| Sunshine | 0.000000 |
| Rainfall | -0.000000 |
| Humidity3pm^2 | -0.000000 |
| Humidity3pm Cloud3pm | -0.000000 |
| Humidity3pm Pressure3pm | -0.000000 |
| Humidity3pm WindSpeed3pm | -0.000000 |
| Humidity3pm WindDir3pm | -0.000000 |
| Humidity3pm Sunshine | -0.000000 |
| Humidity3pm Rainfall | -0.000000 |
| Cloud3pm^2 | 0.000000 |
| Cloud3pm Pressure3pm | -0.000000 |
| Cloud3pm WindSpeed3pm | -0.000000 |
| Cloud3pm WindDir3pm | 0.000000 |
| Cloud3pm Sunshine | 0.000000 |
| Cloud3pm Rainfall | -0.000000 |
| Pressure3pm^2 | -3.868799 |
| Pressure3pm WindSpeed3pm | -0.000000 |
| Pressure3pm WindDir3pm | -0.000000 |
| Pressure3pm Sunshine | -0.000000 |
| Pressure3pm Rainfall | -0.000000 |
| WindSpeed3pm^2 | -0.000000 |
| WindSpeed3pm WindDir3pm | -0.000000 |
| WindSpeed3pm Sunshine | -0.000000 |
| WindSpeed3pm Rainfall | -0.000000 |
| WindDir3pm^2 | -0.000000 |
| WindDir3pm Sunshine | 0.000000 |
| WindDir3pm Rainfall | -0.000000 |
| Sunshine^2 | 11.676768 |
| Sunshine Rainfall | -0.000000 |
| Rainfall^2 | -0.000000 |
Notice the weights? Most of them were forced to zero, meaning our model will not use the corresponding features for its prediction. The intuition here is that the corresponding features hadn’t provided enough predictive power to be worth considering alongside the other features.
Let's keep going with the evaluation of the model:
# Predict:
predictions = lasso_model.predict(X_test_poly)
# Compute the MAE, the MSE and the R^2 on the test set
mae_lasso = mean_absolute_error(y_test, predictions)
mse_lasso = mean_squared_error(y_test, predictions)
r2_lasso = r2_score(y_test, predictions)
# Compute the MAE, the MSE and the R^2 on the training set
predictions_train = lasso_model.predict(X_train_poly)
mae_train_lasso = mean_absolute_error(y_train, predictions_train)
mse_train_lasso = mean_squared_error(y_train, predictions_train)
r2_train_lasso = r2_score(y_train, predictions_train)
print(f"MAE test set: {mae_lasso:0.2f}; MAE training set: {mae_train_lasso:0.2f};")
print(f"MSE test set: {mse_lasso:0.2f}; MSE training set: {mse_train_lasso:0.2f};")
print(f"R\u00b2 test set: {r2_lasso:0.2f}; R\u00b2 training set: {r2_train_lasso:0.2f};" )
MAE test set: 4.07; MAE training set: 4.20; MSE test set: 24.21; MSE training set: 26.06; R² test set: 0.38; R² training set: 0.42;
As a result of Lasso regularization, the MAE and MSE on the test set are increasing. Let's pursue our exploration with a Ridge regularization:
Ridge¶
We can use the sklearn Ridge module to implement a Ridge regularization (Documentation). Here is the import line:
from sklearn.linear_model import Ridge
We proceed as before:
# Set up the model
ridge_model = Ridge(alpha=1.0, fit_intercept=False)
# Use fit
ridge_model.fit(X_train_poly, y_train)
# Check the score/accuracy
print("R\u00b2 Score of the model: ", round(ridge_model.score(X_train_poly, y_train), 3))
# Print the coefficients of the linear model
model_coeff = pd.DataFrame(ridge_model.coef_.flatten(),
index=X_poly_features,
columns=['Coefficients Ridge model'])
model_coeff['Coefficients Lasso model']=lasso_model.coef_.flatten()
model_coeff['Coefficients polynomial model']=model_poly.coef_.flatten()
model_coeff
R² Score of the model: 0.66
| Coefficients Ridge model | Coefficients Lasso model | Coefficients polynomial model | |
|---|---|---|---|
| 1 | 21.167430 | 16.480539 | 70.818534 |
| Humidity3pm | -3.083184 | -3.131065 | -34.497306 |
| Cloud3pm | 4.714199 | 2.723063 | -14.110065 |
| Pressure3pm | -1.346583 | -0.000000 | -58.905587 |
| WindSpeed3pm | -3.411386 | -2.157586 | -44.953044 |
| WindDir3pm | 2.601451 | 0.000000 | -16.625414 |
| Sunshine | 8.691005 | 0.000000 | -18.112180 |
| Rainfall | 0.266296 | -0.000000 | -2.382792 |
| Humidity3pm^2 | 0.421859 | -0.000000 | 5.297209 |
| Humidity3pm Cloud3pm | 0.046893 | -0.000000 | 3.811526 |
| Humidity3pm Pressure3pm | -1.658249 | -0.000000 | 28.136987 |
| Humidity3pm WindSpeed3pm | -4.754987 | -0.000000 | 12.537486 |
| Humidity3pm WindDir3pm | -7.051185 | -0.000000 | -8.755339 |
| Humidity3pm Sunshine | -8.350637 | -0.000000 | -7.294708 |
| Humidity3pm Rainfall | 1.697219 | -0.000000 | 8.673520 |
| Cloud3pm^2 | -0.601082 | 0.000000 | -1.263658 |
| Cloud3pm Pressure3pm | -0.183609 | -0.000000 | 6.184263 |
| Cloud3pm WindSpeed3pm | -0.052566 | -0.000000 | 11.543304 |
| Cloud3pm WindDir3pm | -0.775338 | 0.000000 | 1.151535 |
| Cloud3pm Sunshine | 0.226056 | 0.000000 | 14.325605 |
| Cloud3pm Rainfall | 1.919206 | -0.000000 | 11.561408 |
| Pressure3pm^2 | -5.906742 | -3.868799 | 10.339247 |
| Pressure3pm WindSpeed3pm | -4.299129 | -0.000000 | 16.659560 |
| Pressure3pm WindDir3pm | -2.454472 | -0.000000 | 18.404142 |
| Pressure3pm Sunshine | -7.136742 | -0.000000 | -1.986992 |
| Pressure3pm Rainfall | 0.211044 | -0.000000 | -0.245664 |
| WindSpeed3pm^2 | -1.303169 | -0.000000 | 5.415717 |
| WindSpeed3pm WindDir3pm | -4.157581 | -0.000000 | 0.445015 |
| WindSpeed3pm Sunshine | -1.053097 | -0.000000 | 15.305645 |
| WindSpeed3pm Rainfall | -1.131560 | -0.000000 | -19.222073 |
| WindDir3pm^2 | 0.828722 | -0.000000 | 4.630299 |
| WindDir3pm Sunshine | 0.343187 | 0.000000 | 0.635540 |
| WindDir3pm Rainfall | -0.973821 | -0.000000 | 20.616689 |
| Sunshine^2 | 7.080499 | 11.676768 | 13.589054 |
| Sunshine Rainfall | -2.841213 | -0.000000 | -5.688230 |
| Rainfall^2 | -2.574652 | -0.000000 | -22.396915 |
Note how the coefficients with the Ridge regularization were shrinked, but not forced to zero as in the Lasso regularization.
Let's evaluate our new model.
# Predict:
predictions = ridge_model.predict(X_test_poly)
# Compute the MAE, the MSE and the R^2 on the test set
mae_ridge = mean_absolute_error(y_test, predictions)
mse_ridge = mean_squared_error(y_test, predictions)
r2_ridge = r2_score(y_test, predictions)
# Compute the MAE, the MSE and the R^2 on the training set
predictions_train = ridge_model.predict(X_train_poly)
mae_train_ridge = mean_absolute_error(y_train, predictions_train)
mse_train_ridge = mean_squared_error(y_train, predictions_train)
r2_train_ridge = r2_score(y_train, predictions_train)
print(f"MAE test set: {mae_ridge:0.2f}; MAE training set: {mae_train_ridge:0.2f};")
print(f"MSE test set: {mse_ridge:0.2f}; MSE training set: {mse_train_ridge:0.2f};")
print(f"R\u00b2 test set: {r2_ridge:0.2f}; R\u00b2 training set: {r2_train_ridge:0.2f};" )
MAE test set: 3.00; MAE training set: 3.19; MSE test set: 14.00; MSE training set: 15.38; R² test set: 0.64; R² training set: 0.66;
Let's visualize the MAE and MSE on the test data obtained in our different models:
model_comparison = pd.DataFrame([mae, mse], index=['MAE', 'MSE'], columns=['Polynomial model'])
model_comparison['LASSO']=[mae_lasso, mse_lasso]
model_comparison['Ridge']=[mae_ridge, mse_ridge]
model_comparison
| Polynomial model | LASSO | Ridge | |
|---|---|---|---|
| MAE | 2.943465 | 4.070092 | 2.997895 |
| MSE | 13.838756 | 24.208311 | 13.999620 |
The polynomial and Ridge models seem to perform similarly.
However, note that the regularization parameter $\alpha$ has a large impact on MAE and MSE in the test data. Moreover, the relationship between the test data MSE and $\alpha$ is complicated and non-monotonic. Hence, one popular method for choosing the regularization parameter is cross-validation, which we will implement below.
K-fold cross validation¶
Roughly speaking, cross-validation splits the training dataset into many training/testing subsets, then chooses the regularization parameter value that minimizes the average MSE.
More precisely, k-fold cross-validation does the following:
- Partition the dataset randomly into k subsets/”folds”.
- Compute $MSE_j(\alpha)=$ mean squared error in j-th subset when using the j-th subset as test data, and other k-1 as training data.
- Minimize average (across folds) MSE $\min_\alpha \frac{1}{k}\sum_{j=1}^k MSE_j(\alpha)$.
You can find a more detailed description of cross-validation here.
We will implement cross validation in addition of our previous polynomial linear regression with ridge regularization. We are using the sklearn RidgeCV module (Documentation). Here is the import line:
from sklearn.linear_model import RidgeCV
In case, a similar module exists for Lasso regularization with cross-validation, namely LassoCV (Documentation).
With the argument cv, we can specify the number of folds:
# Set up the model
ridge_cv_model = RidgeCV(cv=5, fit_intercept=False)
# Use fit
ridge_cv_model.fit(X_train_poly, y_train)
# Check the score/accuracy
print("R\u00b2 Score of the model: ", round(ridge_cv_model.score(X_train_poly, y_train), 3))
# Print the coefficients of the linear model
model_coeff['Coefficients Ridge-CV model']=ridge_cv_model.coef_.flatten()
model_coeff
R² Score of the model: 0.701
| Coefficients Ridge model | Coefficients Lasso model | Coefficients polynomial model | Coefficients Ridge-CV model | |
|---|---|---|---|---|
| 1 | 21.167430 | 16.480539 | 70.818534 | 29.980247 |
| Humidity3pm | -3.083184 | -3.131065 | -34.497306 | -8.429711 |
| Cloud3pm | 4.714199 | 2.723063 | -14.110065 | 2.558045 |
| Pressure3pm | -1.346583 | -0.000000 | -58.905587 | -8.230082 |
| WindSpeed3pm | -3.411386 | -2.157586 | -44.953044 | -13.383001 |
| WindDir3pm | 2.601451 | 0.000000 | -16.625414 | -0.909233 |
| Sunshine | 8.691005 | 0.000000 | -18.112180 | 11.709118 |
| Rainfall | 0.266296 | -0.000000 | -2.382792 | 0.906458 |
| Humidity3pm^2 | 0.421859 | -0.000000 | 5.297209 | 2.632328 |
| Humidity3pm Cloud3pm | 0.046893 | -0.000000 | 3.811526 | 0.296088 |
| Humidity3pm Pressure3pm | -1.658249 | -0.000000 | 28.136987 | 5.541463 |
| Humidity3pm WindSpeed3pm | -4.754987 | -0.000000 | 12.537486 | -1.600918 |
| Humidity3pm WindDir3pm | -7.051185 | -0.000000 | -8.755339 | -10.940408 |
| Humidity3pm Sunshine | -8.350637 | -0.000000 | -7.294708 | -13.265509 |
| Humidity3pm Rainfall | 1.697219 | -0.000000 | 8.673520 | 3.984783 |
| Cloud3pm^2 | -0.601082 | 0.000000 | -1.263658 | -3.654692 |
| Cloud3pm Pressure3pm | -0.183609 | -0.000000 | 6.184263 | -0.972508 |
| Cloud3pm WindSpeed3pm | -0.052566 | -0.000000 | 11.543304 | 4.504809 |
| Cloud3pm WindDir3pm | -0.775338 | 0.000000 | 1.151535 | -1.389989 |
| Cloud3pm Sunshine | 0.226056 | 0.000000 | 14.325605 | 5.319233 |
| Cloud3pm Rainfall | 1.919206 | -0.000000 | 11.561408 | 8.245063 |
| Pressure3pm^2 | -5.906742 | -3.868799 | 10.339247 | -4.601638 |
| Pressure3pm WindSpeed3pm | -4.299129 | -0.000000 | 16.659560 | 0.716887 |
| Pressure3pm WindDir3pm | -2.454472 | -0.000000 | 18.404142 | 4.675586 |
| Pressure3pm Sunshine | -7.136742 | -0.000000 | -1.986992 | -17.329171 |
| Pressure3pm Rainfall | 0.211044 | -0.000000 | -0.245664 | -0.476549 |
| WindSpeed3pm^2 | -1.303169 | -0.000000 | 5.415717 | 0.202900 |
| WindSpeed3pm WindDir3pm | -4.157581 | -0.000000 | 0.445015 | -3.559479 |
| WindSpeed3pm Sunshine | -1.053097 | -0.000000 | 15.305645 | 1.017390 |
| WindSpeed3pm Rainfall | -1.131560 | -0.000000 | -19.222073 | -4.157175 |
| WindDir3pm^2 | 0.828722 | -0.000000 | 4.630299 | 2.909522 |
| WindDir3pm Sunshine | 0.343187 | 0.000000 | 0.635540 | -3.321442 |
| WindDir3pm Rainfall | -0.973821 | -0.000000 | 20.616689 | 4.404557 |
| Sunshine^2 | 7.080499 | 11.676768 | 13.589054 | 6.678826 |
| Sunshine Rainfall | -2.841213 | -0.000000 | -5.688230 | -6.199071 |
| Rainfall^2 | -2.574652 | -0.000000 | -22.396915 | -11.713438 |
As always, let's evaluate our model:
# Predict:
predictions = ridge_cv_model.predict(X_test_poly)
# Compute the MAE, the MSE and the R^2 on the test set
mae_ridge_cv = mean_absolute_error(y_test, predictions)
mse_ridge_cv = mean_squared_error(y_test, predictions)
r2_ridge_cv = r2_score(y_test, predictions)
# Compute the MAE, the MSE and the R^2 on the training set
predictions_train = ridge_cv_model.predict(X_train_poly)
mae_train_ridge_cv = mean_absolute_error(y_train, predictions_train)
mse_train_ridge_cv = mean_squared_error(y_train, predictions_train)
r2_train_ridge_cv = r2_score(y_train, predictions_train)
print(f"MAE test set: {mae_ridge:0.2f}; MAE training set: {mae_train_ridge:0.2f};")
print(f"MSE test set: {mse_ridge:0.2f}; MSE training set: {mse_train_ridge:0.2f};")
print(f"R\u00b2 test set: {r2_ridge:0.2f}; R\u00b2 training set: {r2_train_ridge:0.2f};" )
MAE test set: 3.00; MAE training set: 3.19; MSE test set: 14.00; MSE training set: 15.38; R² test set: 0.64; R² training set: 0.66;
Let's compare our model to the previous ones:
model_comparison['Ridge with Cross Validation']=[mae_ridge_cv, mse_ridge_cv]
model_comparison
| Polynomial model | LASSO | Ridge | Ridge with Cross Validation | |
|---|---|---|---|---|
| MAE | 2.943465 | 4.070092 | 2.997895 | 2.899290 |
| MSE | 13.838756 | 24.208311 | 13.999620 | 13.345064 |
By optimizing the regularization parameter, the MAE and MSE decreased a little.
Assignment Solution¶
In this task, you will study the energy efficiency of buildings. More precisely, you will try to predict the heating loads of buildings based on the following features:
- Relative Compactness
- Surface Area
- Wall Area
- Roof Area
- Overall Height
- Orientation
- Galzing Area
- Glazing Area Distribution
You will use the Energy-Efficiency-Dataset, created by Angeliki Xifara, processed by Athanasios Tsanas, and made available on OpenML, an open platform for sharing datasets, algorithms, and experiments.
Let's load the dataset!
df = pd.read_csv('energy-efficiency.csv', index_col=0)
Discover your dataset
- Explore your dataset, displaying a few observations, the types of your data, some summary statistics, the correlation matrix and so on.
# YOUR CODE HERE
# Display a sample of the data
display(df.head())
# Display the data types
display(df.dtypes)
# Summary statistics
display(df.describe())
# Correlation matrix
df.corr()
| Relative Compactness | Surface Area | Wall Area | Roof Area | Overall Height | Orientation | Galzing Area | Glazing Area Distribution | Heating Load | Cooling Load | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 2.0 | 0.0 | 0.0 | 15.55 | 21.33 |
| 1 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 3.0 | 0.0 | 0.0 | 15.55 | 21.33 |
| 2 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 4.0 | 0.0 | 0.0 | 15.55 | 21.33 |
| 3 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 5.0 | 0.0 | 0.0 | 15.55 | 21.33 |
| 4 | 0.90 | 563.5 | 318.5 | 122.50 | 7.0 | 2.0 | 0.0 | 0.0 | 20.84 | 28.28 |
Relative Compactness float64 Surface Area float64 Wall Area float64 Roof Area float64 Overall Height float64 Orientation float64 Galzing Area float64 Glazing Area Distribution float64 Heating Load float64 Cooling Load float64 dtype: object
| Relative Compactness | Surface Area | Wall Area | Roof Area | Overall Height | Orientation | Galzing Area | Glazing Area Distribution | Heating Load | Cooling Load | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.00000 | 768.000000 | 768.000000 | 768.00000 | 768.000000 | 768.000000 |
| mean | 0.764167 | 671.708333 | 318.500000 | 176.604167 | 5.25000 | 3.500000 | 0.234375 | 2.81250 | 22.307201 | 24.587760 |
| std | 0.105777 | 88.086116 | 43.626481 | 45.165950 | 1.75114 | 1.118763 | 0.133221 | 1.55096 | 10.090196 | 9.513306 |
| min | 0.620000 | 514.500000 | 245.000000 | 110.250000 | 3.50000 | 2.000000 | 0.000000 | 0.00000 | 6.010000 | 10.900000 |
| 25% | 0.682500 | 606.375000 | 294.000000 | 140.875000 | 3.50000 | 2.750000 | 0.100000 | 1.75000 | 12.992500 | 15.620000 |
| 50% | 0.750000 | 673.750000 | 318.500000 | 183.750000 | 5.25000 | 3.500000 | 0.250000 | 3.00000 | 18.950000 | 22.080000 |
| 75% | 0.830000 | 741.125000 | 343.000000 | 220.500000 | 7.00000 | 4.250000 | 0.400000 | 4.00000 | 31.667500 | 33.132500 |
| max | 0.980000 | 808.500000 | 416.500000 | 220.500000 | 7.00000 | 5.000000 | 0.400000 | 5.00000 | 43.100000 | 48.030000 |
| Relative Compactness | Surface Area | Wall Area | Roof Area | Overall Height | Orientation | Galzing Area | Glazing Area Distribution | Heating Load | Cooling Load | |
|---|---|---|---|---|---|---|---|---|---|---|
| Relative Compactness | 1.000000e+00 | -9.919015e-01 | -2.037817e-01 | -8.688234e-01 | 8.277473e-01 | 4.678592e-17 | -2.960552e-15 | -7.107006e-16 | 0.622272 | 0.634339 |
| Surface Area | -9.919015e-01 | 1.000000e+00 | 1.955016e-01 | 8.807195e-01 | -8.581477e-01 | -3.459372e-17 | 3.636925e-15 | 2.438409e-15 | -0.658120 | -0.672999 |
| Wall Area | -2.037817e-01 | 1.955016e-01 | 1.000000e+00 | -2.923165e-01 | 2.809757e-01 | -2.429499e-17 | -8.567455e-17 | 2.067384e-16 | 0.455671 | 0.427117 |
| Roof Area | -8.688234e-01 | 8.807195e-01 | -2.923165e-01 | 1.000000e+00 | -9.725122e-01 | -5.830058e-17 | -1.759011e-15 | -1.078071e-15 | -0.861828 | -0.862547 |
| Overall Height | 8.277473e-01 | -8.581477e-01 | 2.809757e-01 | -9.725122e-01 | 1.000000e+00 | 4.492205e-17 | 1.489134e-17 | -2.920613e-17 | 0.889431 | 0.895785 |
| Orientation | 4.678592e-17 | -3.459372e-17 | -2.429499e-17 | -5.830058e-17 | 4.492205e-17 | 1.000000e+00 | -9.406007e-16 | -2.549352e-16 | -0.002587 | 0.014290 |
| Galzing Area | -2.960552e-15 | 3.636925e-15 | -8.567455e-17 | -1.759011e-15 | 1.489134e-17 | -9.406007e-16 | 1.000000e+00 | 2.129642e-01 | 0.269841 | 0.207505 |
| Glazing Area Distribution | -7.107006e-16 | 2.438409e-15 | 2.067384e-16 | -1.078071e-15 | -2.920613e-17 | -2.549352e-16 | 2.129642e-01 | 1.000000e+00 | 0.087368 | 0.050525 |
| Heating Load | 6.222722e-01 | -6.581202e-01 | 4.556712e-01 | -8.618283e-01 | 8.894307e-01 | -2.586534e-03 | 2.698410e-01 | 8.736759e-02 | 1.000000 | 0.975862 |
| Cooling Load | 6.343391e-01 | -6.729989e-01 | 4.271170e-01 | -8.625466e-01 | 8.957852e-01 | 1.428960e-02 | 2.075050e-01 | 5.052512e-02 | 0.975862 | 1.000000 |
We visualize our correlation matrix with a heatmap, using seaborn library:
sns.heatmap(df.corr().round(decimals=2), annot=True, cmap="bwr")
plt.show()
Let's do a boxplot. Since the features have very different values, we do separate boxplot for each:
df.plot(
kind='box',
subplots=True,
sharey=False,
figsize=(15, 5)
)
# increase spacing between subplots
plt.subplots_adjust(wspace=1)
plt.show()
Let's do a pairplot:
sns.pairplot(df)
plt.show()
/Users/bogo/opt/anaconda3/envs/MachLe310/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Multivariate linear regression
You will first implement a multivariate linear regression, using all the features listed in the assignment to predict the heating load.
- Select your features and split your dataset between training and test set, using a 80-20 split
# YOUR CODE HERE
# We select our features:
X = df[['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height', 'Orientation', 'Galzing Area' , 'Glazing Area Distribution']]
y = df[['Heating Load']]
# We split our dataset:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)
- Rescale your features
# YOUR CODE HERE
# Select scaler:
scaler = MinMaxScaler()
# Fit scaler and transform training set
X_train = scaler.fit_transform(X_train)
# Transform test set
X_test = scaler.transform(X_test)
- Create a linear regression model and train it.
# YOUR CODE HERE
# Create model
model = LinearRegression()
# Train model
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
- What is the $R^2$ of your model?
- Display a dataframe with the coefficients of your model. Which ones are relatively more important?
# YOUR CODE HERE
# Compute and display R^2
print("R\u00b2 Score of the model: ", round(model.score(X_train, y_train), 3))
# Print intercept
print("Intercept: ", model.intercept_[0])
# Feature names
features = ['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height', 'Orientation', 'Galzing Area' , 'Glazing Area Distribution']
# Create dataframe with feature names and associated coefficients
model_coeff = pd.DataFrame(model.coef_.flatten(),
index=features,
columns=['Coefficients multivariate model'])
model_coeff
R² Score of the model: 0.918 Intercept: 28472140757642.45
| Coefficients multivariate model | |
|---|---|
| Relative Compactness | -2.203722e+01 |
| Surface Area | 1.708328e+14 |
| Wall Area | -9.965249e+13 |
| Roof Area | -1.281246e+14 |
| Overall Height | 1.407812e+01 |
| Orientation | -5.114746e-02 |
| Galzing Area | 7.898834e+00 |
| Glazing Area Distribution | 8.134766e-01 |
- What are the MAE, MSE, and $R^2$ on the test data? How do they compare to the same metrics on the training data?
We will write a function to automate the procedure since we will perform the same operations for several models. In addition, we will store the MAE and MSE in a dataframe to compare their values with other models at the end.
def errors_model(X_test, y_test, X_train, y_train, model, model_name):
"""This function computes the MAE, MSE, R^2 of a predictive model, both on the test set and on the training set."""
# Predict on test set
predictions = model.predict(X_test)
# Compute the MAE, the MSE and the R^2 on the test set
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
# Compute the MAE, the MSE and the R^2 on the training set
predictions_train = model.predict(X_train)
mae_train = mean_absolute_error(y_train, predictions_train)
mse_train = mean_squared_error(y_train, predictions_train)
r2_train = r2_score(y_train, predictions_train)
# Print results
print(f"MAE test set: {mae:0.2f}; MAE training set: {mae_train:0.2f};")
print(f"MSE test set: {mse:0.2f}; MSE training set: {mse_train:0.2f};")
print(f"R\u00b2 test set: {r2:0.2f}; R\u00b2 training set: {r2_train:0.2f};" )
# We also create a dataframe with the results MAE and MSE for the test set
df_errors = pd.DataFrame([mae, mse], index=['MAE', 'MSE'], columns=[model_name])
return df_errors
errors_multivariate = errors_model(X_test, y_test, X_train, y_train, model, 'Multivariate')
MAE test set: 2.23; MAE training set: 2.02; MSE test set: 10.04; MSE training set: 8.15; R² test set: 0.91; R² training set: 0.92;
Our MAE and MSE increase a bit, but stay comparable. Similarly the $R^2$ is similar. It seems that our model can be generalized.
Polynomial linear regression
- Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to 2
# YOUR CODE HERE
# Create polynomial model with 2 degrees
poly = PolynomialFeatures(2)
# Fit model and transform train set
X_train_poly = poly.fit_transform(X_train)
# Transform test set
X_test_poly = poly.transform(X_test)
# Get new feature names
X_poly_features = poly.get_feature_names_out(features)
print(X_poly_features)
['1' 'Relative Compactness' 'Surface Area' 'Wall Area' 'Roof Area' 'Overall Height' 'Orientation' 'Galzing Area' 'Glazing Area Distribution' 'Relative Compactness^2' 'Relative Compactness Surface Area' 'Relative Compactness Wall Area' 'Relative Compactness Roof Area' 'Relative Compactness Overall Height' 'Relative Compactness Orientation' 'Relative Compactness Galzing Area' 'Relative Compactness Glazing Area Distribution' 'Surface Area^2' 'Surface Area Wall Area' 'Surface Area Roof Area' 'Surface Area Overall Height' 'Surface Area Orientation' 'Surface Area Galzing Area' 'Surface Area Glazing Area Distribution' 'Wall Area^2' 'Wall Area Roof Area' 'Wall Area Overall Height' 'Wall Area Orientation' 'Wall Area Galzing Area' 'Wall Area Glazing Area Distribution' 'Roof Area^2' 'Roof Area Overall Height' 'Roof Area Orientation' 'Roof Area Galzing Area' 'Roof Area Glazing Area Distribution' 'Overall Height^2' 'Overall Height Orientation' 'Overall Height Galzing Area' 'Overall Height Glazing Area Distribution' 'Orientation^2' 'Orientation Galzing Area' 'Orientation Glazing Area Distribution' 'Galzing Area^2' 'Galzing Area Glazing Area Distribution' 'Glazing Area Distribution^2']
- Train a linear regression model with polynomial features
- What is the $R^2$
# YOUR CODE HERE
# Set up the model
model_poly = LinearRegression(fit_intercept=False) # we don't need fit intercept since polynomial features function add a column of ones to the data
# Fit
model_poly.fit(X_train_poly, y_train)
# Check the score/accuracy
print("R\u00b2 Score of the model: ", round(model_poly.score(X_train_poly, y_train), 3))
R² Score of the model: 0.995
- What are the MAE, MSE, and $R^2$ on the test data? How do they compare to the same metrics on the training data?
We can use our previously defined function!
# YOUR CODE HERE
errors_polynomial = errors_model(X_test_poly, y_test, X_train_poly, y_train, model_poly, 'Polynomial')
MAE test set: 0.62; MAE training set: 0.52; MSE test set: 0.68; MSE training set: 0.48; R² test set: 0.99; R² training set: 1.00;
The $R^2$ of one on the training set is suspicious, since it indicates a perfect fit. Could it be overfitting? Well, the errors are quite low and similar between training and test set so it seems that we have no overfitting issues, and we designed a great predictive model.
Regularization
- Train a linear regression model with polynomial features and ridge regression
- What are the MAE, MSE, and $R^2$ on the test data?
# YOUR CODE HERE
# Create and Fit model, all in one line!
ridge_model = Ridge(alpha=1.0, fit_intercept=False).fit(X_train_poly, y_train)
# Compute errors using our function:
errors_ridge = errors_model(X_test_poly, y_test, X_train_poly, y_train, ridge_model, 'Ridge')
MAE test set: 1.94; MAE training set: 1.70; MSE test set: 7.68; MSE training set: 6.06; R² test set: 0.93; R² training set: 0.94;
- Train a linear regression model with polynomial features and lasso regression
- What are the MAE, MSE, and $R^2$ on the test data?
# YOUR CODE HERE
# Create and Fit model, all in one line!
lasso_model = Lasso(alpha=1, fit_intercept=False).fit(X_train_poly, y_train)
# Compute errors using our function:
errors_lasso = errors_model(X_test_poly, y_test, X_train_poly, y_train, lasso_model, 'LASSO')
MAE test set: 3.30; MAE training set: 2.89; MSE test set: 21.67; MSE training set: 16.49; R² test set: 0.80; R² training set: 0.83;
Cross-validation
- Train a linear regression model with polynomial features, cross-validation, and the regularization technique of your choice (you can even explore other ones, such as Elastic Net)
- What are the MAE, MSE, and $R^2$ on the test data?
We will use Lasso regularization with cross validation LassoCV to try a technique we did not implement before:
y_test.values.flatten()
array([15.18, 10.32, 37.26, 16.95, 32.26, 27.9 , 28.18, 28.95, 29.07,
23.8 , 6.4 , 42.5 , 11.22, 43.1 , 41.96, 26.33, 10.7 , 28.09,
14.65, 12.29, 12.46, 32.71, 10.77, 38.57, 6.04, 14.66, 13. ,
14.41, 10.75, 39.89, 12.74, 12.74, 41.26, 12.77, 28.56, 35.99,
13.97, 35.69, 17.14, 10.85, 11.6 , 39.97, 14.42, 25.17, 19.5 ,
24.4 , 13.69, 12.12, 28.86, 32.75, 32.24, 12.97, 32.31, 14.96,
36.95, 10.39, 15.55, 26.19, 40.12, 14.53, 14.72, 12.33, 32.23,
10.34, 12.65, 11.8 , 38.98, 11.21, 25.36, 39.83, 32.82, 10.53,
23.89, 17.26, 15.16, 24.94, 32.85, 38.65, 32.84, 14.51, 35.69,
29.62, 11.69, 14.16, 29.92, 25.38, 24.03, 29.63, 12.85, 17.15,
14.18, 26.45, 12.5 , 24.63, 40.78, 15.09, 29.47, 14.75, 7.1 ,
35.64, 36.97, 22.93, 19.34, 32.96, 10.39, 29.88, 42.11, 12.92,
12.57, 15.55, 17.69, 33.09, 32.13, 15.36, 12.19, 15.36, 11.34,
36.77, 26.45, 38.82, 35.01, 12.16, 12.57, 36.06, 12.28, 40.6 ,
14.32, 28.17, 12.32, 32.31, 32.21, 24.63, 35.45, 12.03, 36.7 ,
7.1 , 29.52, 24.58, 33.21, 14.03, 17.37, 19.52, 29.06, 10.71,
35.56, 16.74, 29.03, 36.91, 12.34, 14.33, 28.15, 11.33, 13.86,
14.34])
# YOUR CODE HERE
# Create and fit model, with 5 folds:
lasso_cv_model = LassoCV(cv=5, fit_intercept=False).fit(X_train_poly, y_train.values.flatten())
# Erros:
errors_cv = errors_model(X_test_poly, y_test.values.flatten(), X_train_poly, y_train.values.flatten(), lasso_cv_model, 'LASSO with Cross Validation')
MAE test set: 1.97; MAE training set: 1.79; MSE test set: 8.59; MSE training set: 6.85; R² test set: 0.92; R² training set: 0.93;
Model comparison
- Compare your models. Which one gives the most accurate prediction?
- Display the coefficients of your best model. What do you observe?
We will concatenate our previously obtained dataframes:
# YOUR CODE HERE
# compile the list of dataframes you want to merge
df_errors_list = [errors_multivariate, errors_polynomial, errors_ridge, errors_lasso, errors_cv]
model_comparison = pd.concat(df_errors_list, axis=1)
model_comparison
| Multivariate | Polynomial | Ridge | LASSO | LASSO with Cross Validation | |
|---|---|---|---|---|---|
| MAE | 2.228756 | 0.620909 | 1.944348 | 3.298777 | 1.973989 |
| MSE | 10.040484 | 0.680979 | 7.683001 | 21.672429 | 8.586230 |
We had already observed that the Polynomial model lead to very good predictions, and regularization was probably not needed in our case (but it's always fun to explore several models). Ridge performs better than LASSO. When we optimize the LASSO penalty term with cross validation, we obtain similar results as with Ridge.
Note: We did not use the $R^2$ because our models have a different number of features, hence comparing $R^2$ would be misleading. Instead, we could have computed the adjusted $R^2$
Let's print the coefficients of our polynomial model:
model_coeff = pd.DataFrame(model_poly.coef_.flatten(),
index=X_poly_features,
columns=['Coefficients polynomial model'])
model_coeff
| Coefficients polynomial model | |
|---|---|
| 1 | 4.605692e+11 |
| Relative Compactness | 1.421639e+10 |
| Surface Area | -9.310856e+10 |
| Wall Area | 6.451567e+10 |
| Roof Area | -3.816844e+11 |
| Overall Height | -2.240243e+11 |
| Orientation | -7.776527e-01 |
| Galzing Area | 2.175857e+01 |
| Glazing Area Distribution | 9.577843e+00 |
| Relative Compactness^2 | -2.187030e+04 |
| Relative Compactness Surface Area | -7.330324e+09 |
| Relative Compactness Wall Area | 4.276002e+09 |
| Relative Compactness Roof Area | -9.940348e+09 |
| Relative Compactness Overall Height | -1.029204e+10 |
| Relative Compactness Orientation | 1.074892e+00 |
| Relative Compactness Galzing Area | -2.278633e-01 |
| Relative Compactness Glazing Area Distribution | -1.636900e+00 |
| Surface Area^2 | 1.748523e+10 |
| Surface Area Wall Area | -2.222424e+10 |
| Surface Area Roof Area | 1.990837e+10 |
| Surface Area Overall Height | 2.080261e+11 |
| Surface Area Orientation | 4.122429e-01 |
| Surface Area Galzing Area | -6.643944e+00 |
| Surface Area Glazing Area Distribution | -2.476692e+00 |
| Wall Area^2 | 7.014303e+09 |
| Wall Area Roof Area | -2.075108e+10 |
| Wall Area Overall Height | -1.283528e+11 |
| Wall Area Orientation | 5.625706e-01 |
| Wall Area Galzing Area | 4.015547e+00 |
| Wall Area Glazing Area Distribution | 6.690845e-01 |
| Roof Area^2 | -4.213454e+10 |
| Roof Area Overall Height | 3.274149e+11 |
| Roof Area Orientation | 2.136078e-01 |
| Roof Area Galzing Area | -6.936222e+00 |
| Roof Area Glazing Area Distribution | -1.945717e+00 |
| Overall Height^2 | -2.240243e+11 |
| Overall Height Orientation | -3.270512e-01 |
| Overall Height Galzing Area | -4.985657e+00 |
| Overall Height Glazing Area Distribution | -1.251072e+00 |
| Orientation^2 | -3.299675e-01 |
| Orientation Galzing Area | 2.781296e-02 |
| Orientation Glazing Area Distribution | 2.947769e-01 |
| Galzing Area^2 | -3.087727e+00 |
| Galzing Area Glazing Area Distribution | -3.680084e+00 |
| Glazing Area Distribution^2 | -3.295742e+00 |
We obtained extremely high and low values, which we cannot really interpret. There are several possible explanations:
- the features might not be properly scaled: here, it is not the issue, our scaling made our features comparable.
- overfitting: we compared the performance of our model on the training and test sets, and obtained good performance for both, so overfitting does not seem like the issue.
- collinearity between features: this is it! if we check our correlation matrix, we see that we have a correlation of almost 1 between X1 and X2
So, is it an issue? Well, it depends. If all we care about is good prediction, it is not critical, although we need to be careful that our model can be generalized to new data.
However, we may care about intepretability, for instance, if we want to gain insights into the underlying factors driving our predictions or if we want to communicate our results in an understandable way. What do we do in this case? We have several options:
- either remove one of the correlated features or combine them into a single feature
- implement regularization (as we did) to shrink coefficients (Ridge) or select features (LASSO)
- explore alternative models
Let's see how regularization modified our coefficients:
model_coeff['Coefficients RIDGE model'] = ridge_model.coef_.flatten()
model_coeff['Coefficients LASSO model'] = lasso_model.coef_.flatten()
model_coeff['Coefficients LASSO-CV model'] = lasso_cv_model.coef_.flatten()
model_coeff
| Coefficients polynomial model | Coefficients RIDGE model | Coefficients LASSO model | Coefficients LASSO-CV model | |
|---|---|---|---|---|
| 1 | 4.605692e+11 | 2.826509 | 10.802822 | 7.180690 |
| Relative Compactness | 1.421639e+10 | 2.557123 | 0.000000 | 0.000000 |
| Surface Area | -9.310856e+10 | 2.249243 | 0.000000 | 0.000000 |
| Wall Area | 6.451567e+10 | 3.553134 | 0.151052 | 2.431144 |
| Roof Area | -3.816844e+11 | 0.863554 | 0.000000 | 0.000000 |
| Overall Height | -2.240243e+11 | 3.853848 | 15.739066 | 10.912068 |
| Orientation | -7.776527e-01 | -0.666793 | 0.000000 | 0.000000 |
| Galzing Area | 2.175857e+01 | 5.471534 | 4.349949 | 6.962932 |
| Glazing Area Distribution | 9.577843e+00 | 2.700909 | 0.000000 | 1.915740 |
| Relative Compactness^2 | -2.187030e+04 | 1.800994 | 0.000000 | -0.000000 |
| Relative Compactness Surface Area | -7.330324e+09 | 2.061941 | 0.000000 | 0.000000 |
| Relative Compactness Wall Area | 4.276002e+09 | 4.623310 | 0.000000 | 0.000000 |
| Relative Compactness Roof Area | -9.940348e+09 | -0.278403 | 0.000000 | 0.000000 |
| Relative Compactness Overall Height | -1.029204e+10 | -0.447347 | 0.000000 | -0.000000 |
| Relative Compactness Orientation | 1.074892e+00 | 0.144284 | 0.000000 | -0.000000 |
| Relative Compactness Galzing Area | -2.278633e-01 | 5.811772 | 0.000000 | 0.000000 |
| Relative Compactness Glazing Area Distribution | -1.636900e+00 | 2.668351 | 0.000000 | 0.000000 |
| Surface Area^2 | 1.748523e+10 | 0.409940 | 0.000000 | 0.000000 |
| Surface Area Wall Area | -2.222424e+10 | -0.287847 | 0.000000 | 0.000000 |
| Surface Area Roof Area | 1.990837e+10 | 1.270299 | 0.000000 | 0.000000 |
| Surface Area Overall Height | 2.080261e+11 | 6.418693 | 0.000000 | 0.667460 |
| Surface Area Orientation | 4.122429e-01 | 0.303049 | 0.000000 | 0.000000 |
| Surface Area Galzing Area | -6.643944e+00 | 1.184801 | 0.000000 | 0.000000 |
| Surface Area Glazing Area Distribution | -2.476692e+00 | 1.150970 | 0.000000 | 0.000000 |
| Wall Area^2 | 7.014303e+09 | -3.980225 | 0.000000 | -0.000000 |
| Wall Area Roof Area | -2.075108e+10 | 3.501519 | 0.000000 | 2.878083 |
| Wall Area Overall Height | -1.283528e+11 | 9.673425 | 0.000000 | 8.507564 |
| Wall Area Orientation | 5.625706e-01 | 0.111906 | 0.000000 | 0.000000 |
| Wall Area Galzing Area | 4.015547e+00 | 3.256192 | 0.000000 | 0.412138 |
| Wall Area Glazing Area Distribution | 6.690845e-01 | 1.565495 | 0.000000 | 0.000000 |
| Roof Area^2 | -4.213454e+10 | -0.837770 | 0.000000 | 0.000000 |
| Roof Area Overall Height | 3.274149e+11 | 1.890893 | 0.000000 | 0.000000 |
| Roof Area Orientation | 2.136078e-01 | 0.168851 | 0.000000 | 0.000000 |
| Roof Area Galzing Area | -6.936222e+00 | 0.263038 | 0.000000 | -0.000000 |
| Roof Area Glazing Area Distribution | -1.945717e+00 | 0.917221 | 0.000000 | 0.000000 |
| Overall Height^2 | -2.240243e+11 | 3.853848 | 0.000000 | 0.000000 |
| Overall Height Orientation | -3.270512e-01 | 0.198285 | 0.000000 | 0.000000 |
| Overall Height Galzing Area | -4.985657e+00 | 2.073087 | 0.000000 | 4.229869 |
| Overall Height Glazing Area Distribution | -1.251072e+00 | 0.665912 | 0.000000 | 0.540700 |
| Orientation^2 | -3.299675e-01 | -0.277550 | 0.000000 | 0.000000 |
| Orientation Galzing Area | 2.781296e-02 | 0.882388 | 0.000000 | 0.000000 |
| Orientation Glazing Area Distribution | 2.947769e-01 | -0.013682 | 0.000000 | 0.000000 |
| Galzing Area^2 | -3.087727e+00 | -1.763214 | 0.000000 | -0.000000 |
| Galzing Area Glazing Area Distribution | -3.680084e+00 | -3.359032 | 0.000000 | -2.680044 |
| Glazing Area Distribution^2 | -3.295742e+00 | -3.119711 | 0.000000 | -0.000000 |
Model Selection KNN (SOLUTION)¶
# load necessary libraries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
import numpy as np
np.random.seed(1)
%matplotlib inline
kNN: Find the best k?¶
We will again work with the diabetes dataset that contains patient attributes (e.g. age, glucose, ...) and information on whether the patient is diagnosed with diabetes (0 meaning "no", 1 meaning "yes"). The goal is to learn a model that predicts whether a (new) patient has diabetes given a set of patient attributes. This is a classification task and you can use the kNN classifier. The kNN classifier has a hyperparameter k and in order to find its optimal value for our target problem we need to do hyperparameter tuning (model selection). That will be the main goal of the tasks in this exercise.
Prepare the data for learning¶
Get the inputs X and targets y from the dataset. Always leave a poriton of the data for testing. The test dataset should not be used for model development or model selection but should be kept for the performance assessment of the final model. We split the data-set into 80% training and 20% test examples to create the training set and test set.
# get data
df = pd.read_csv('DiabetesDataset.csv')
# keep the patient characteristics as inputs X and the diabetes as target y
X = df.drop(columns=['Diabetes'])
y = df['Diabetes'].values
#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
# check out the size of the training and test datasets
print ("Training Set Size:", len(X_train))
print ("Test Set Size:", len(X_test))
Training Set Size: 691 Test Set Size: 77
1. Cerate a validation set¶
One way to select the best k is to use a validation set (do hyperparameter tuning using a validation set). The validation set is also called a development set.
Obtain a validation set by splitting the previously created training set into two parts: 80% used for training and 20% used for validation.
How many samples are in the three sets now?
# Obtain the validation set
#### START YOUR CODE HERE ####
X_train_val, X_val, y_train_val, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
print ("Training Set Size:", len(X_train_val))
print ("Validation Set Size:", len(X_val))
print ("Test Set Size:", len(X_test))
#### END YOUR CODE HERE ####
Training Set Size: 552 Validation Set Size: 139 Test Set Size: 77
2. Find optimal k using the validation set¶
Use the validation set to estimate the best k for the kNN classifier. Choose the best k from the values from 1 to 100 using the accuracy score. Plot the computed accuracy scores for all considered values of k with a line plot and draw a vertical line at the best k.
# single crossvalidation for given K
from sklearn.neighbors import KNeighborsClassifier
# list to store the scores for the different k values
scores = []
# loop from 1 to 100 to find the best k for k-NN.
for k in range(1,101):
#### START YOUR CODE HERE ####
#train the kNN classifier
knn_val = KNeighborsClassifier(n_neighbors=k).fit(X_train_val, y_train_val)
# compute the predictions on the validation set
pred = knn_val.predict(X_val)
# compute the accuracy score on the validation set
val_score = knn_val.score(X_val, y_val)
# add the accuracy score to your scores list
scores.append(val_score)
# find the k that yields the best accuracy score and the best score
best_k = np.argmax(scores)+1
best_score = scores[best_k-1]
# plot the computed accuracy scores for all tested values of k and draw a vertical line at the k
plt.plot(range(1,101), scores)
plt.title("best k at {} with score of {}".format(best_k, round(best_score,3)))
plt.axvline(x=best_k, c="k", ls="--")
plt.show()
#### END YOUR CODE HERE ####
3. Train a final kNN classifier¶
Train a final classifier using the best k you selected on the validation set and evaluate its performance (accuracy score) on the test set. Print the test accuracy score and compute and display a confusion matrix for the test set.
# Fit the classifier to the training data and evaluate its performance on test set
#### START YOUR SOLUTION HERE ####
knn = KNeighborsClassifier(n_neighbors=best_k).fit(X_train, y_train)
# compute predictions on the test data
pred = knn.predict(X_test)
# compute and print the accuracy on the test data
score = knn.score(X_test, y_test)
print ("Test Accuracy Score: ", score)
# compute the confusion matrix for the test set
conf = confusion_matrix(y_test, pred)
# plot the confusion matrix for the test set using a heatmap
sns.heatmap(conf,
annot=True,
fmt='d',
cbar=False,
cmap="coolwarm_r",
linewidth=1)
plt.title('Test accuracy score for k={} : {}'.format(best_k, round(score,3)))
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
#### END YOUR SOLUTION HERE ####
Test Accuracy Score: 0.7922077922077922
4. Select best k using Cross-Validation (CV)¶
Instead of using a validation set use the whole training data und a 10-fold cross-validation to estimate the best k. Choose the best k from the values from 1 to 100. Plot the mean CV accuracy scores for all considered values of k, their standard deviations and a vertical line for the best k.
# single crossvalidation for given K
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
# arrays to store the mean and standard deviation of the cross validation scores for each tested value of k
mean_scores = np.array([])
scores_std =np.array([])
# loop from 1 to 100 to find the best k for kNN
for k in range(1,101):
#### START YOUR CODE HERE ####
#create a new kNN model
knn_cv = KNeighborsClassifier(n_neighbors=k)
# compute the accuracy scores for each fold of a 10-fold CV
cv_scores = cross_val_score(knn_cv, X_train, y_train, cv=10)
# compute the average and standard deviation of the CV scores and add them to their respective arrays
mean_scores = np.append(mean_scores, np.mean(cv_scores))
scores_std = np.append(scores_std, np.std(cv_scores))
# find the best k and best score
best_k = np.argmax(mean_scores)+1
best_score = mean_scores[best_k-1]
# plot all scores for all tested values of k with their standard deviations and a vertical line the depicts the best k
plt.plot(range(1,101), mean_scores)
plt.title("best k at {} with score of {}".format(best_k, round(best_score,3)))
plt.fill_between(range(0, len(mean_scores)), mean_scores + scores_std, mean_scores - scores_std, alpha=0.15, color='blue')
plt.axvline(x=best_k, c="k", ls="--")
plt.show()
#### END YOUR CODE HERE ####
5. Train a final kNN classifier based on the CV hyperparameter tuning¶
Train a final classifier using the best k you selected in the 10-fold cross-validation and evaluate its performance (accuracy score) on the test set. Print the test accuracy score and compute and display a confusion matrix for the test set.
#### START YOUR SOLUTION HERE ####
# train the final kNN classifier to the training data
knn = KNeighborsClassifier(n_neighbors=best_k).fit(X_train, y_train)
# compute predictions for the test data
pred = knn.predict(X_test)
# compute and print the accuracy score on the test data
score = knn.score(X_test, y_test)
print ("Test Accuracy Score: ", score)
# compute the confusion matrix for the test set
conf = confusion_matrix(y_test, pred)
# plot the confusion matrix for the test set using a heatmap
sns.heatmap(conf,
annot=True,
fmt='d',
cbar=False,
cmap="coolwarm_r",
linewidth=1)
plt.title('Test accuracy score for k={} : {}'.format(best_k, round(score,3)))
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
#### END YOUR SOLUTION HERE ####
Test Accuracy Score: 0.7922077922077922
# Grid Search - hyperparameter tuning when we have more than one parameter
from sklearn.model_selection import GridSearchCV
### START YOUR SOLUTION HERE ###
# define the grid of the three parameters to test
grid = {'n_neighbors':np.arange(1,100),
'p':np.arange(1,3),
'weights':['uniform','distance']
}
# train the kNN model
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, grid, cv=10)
knn_cv.fit(X_train, y_train)
# print the best hyperparameters and the corresponding averaged CV accuracy score
print("Hyperparameters:", knn_cv.best_params_)
print("CV Mean Accuracy Score:", round(knn_cv.best_score_, 4))
### END YOUR SOLUTION HERE ###
Hyperparameters: {'n_neighbors': 12, 'p': 1, 'weights': 'uniform'}
CV Mean Accuracy Score: 0.7611
7. Fit the model using the selected (best) hyperparameters¶
Train a final classifier using the selected, best hyperparaeters in the grid search CV, evaluate its performance (accuracy score) on the test set and print the accuracy score and the confusion matrix.
#### START YOUR SOLUTION HERE ####
# fit a kNN classifier with the best parameters selected in the grid search CV
knn = KNeighborsClassifier(n_neighbors=12, p=1, weights='uniform')
knn.fit(X_train, y_train)
# compute the predictions on the test data using the trained model
pred = knn.predict(X_test)
# compute the test accuracy score
acc_score = knn.score(X_test, y_test)
print ("Test Accuracy Score: ", acc_score)
# compute the confusion matrix for the test set
conf = confusion_matrix(y_test, pred)
# plot the confusion matrix for the test set using a heatmap
sns.heatmap(conf,
annot=True,
fmt='d',
cbar=False,
cmap="coolwarm_r",
linewidth=1)
plt.title('Test Accuracy score for k={} : {}'.format(k, round(acc_score,3)))
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
#### END YOUR SOLUTION HERE ####
Test Accuracy Score: 0.7532467532467533
Decision Trees and Random Forest (SOLUTION)¶
Exercise: Decision Trees¶
We are going to use the breast cancer dataset from sklearn where the goal is to classify each sample as malignant or benign (binary classification task) based on features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
Load the libraries¶
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier, plot_tree
%matplotlib inline
np.random.seed(1)
plt.figure(figsize=(30,30))
<Figure size 3000x3000 with 0 Axes>
<Figure size 3000x3000 with 0 Axes>
Load the data¶
# Load data
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
1. Model fitting¶
In this exercise you need to do the following:
Split the data into a training and a testset using test size of 30% of the training set.
Train a decision tree classifier to the data and visualize it.
Make a prediction for the test set
Evaluate the model's performance by computing the accuracy score and plotting the confusion matrix.
Hints:¶
Decision Trees: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Tree Plot: https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html
Confusion matrix plot: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay
from sklearn.metrics import ConfusionMatrixDisplay
# Apply a decisiontree classifier to the data and visualize your decision tree
#### START YOUR CODE HERE ####
# Split the data into training and test set
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.3)
# fit model
clf = DecisionTreeClassifier(min_samples_leaf=1 , max_depth = None)
clf = clf.fit(trainX, trainy)
# Plot the fitted tree
plot_tree(clf, filled=True, feature_names=list(cancer.feature_names))
plt.show()
# compute predictions for test set
pred = clf.predict(testX)
# Compute the accuracy score
acc_score = accuracy_score(testy, pred)
# Compute the confusion matrix
conf = confusion_matrix(testy, pred)
# Plot the confusion matrix
cm_display = ConfusionMatrixDisplay(conf).plot()
#### END YOUR CODE HERE ###
Tuning tree depth with grid search CV¶
Tune the tree depth parameter using grid seacrh cross validation. Check out depth values between 1 and 10.
What is the optimal tree depth and its corresponding test accuracy score?
Plot the tree with the optimal depth parameter.
What is the CV accuracy for the best parameter (tree depth)?
# Grid Search - tuning tree depth
from sklearn.model_selection import GridSearchCV
#### START YOUR SOLUTION HERE ####
# Define grid for the parameter to test - max_depth from 1 to 10
grid = {'max_depth':np.arange(1,11)}
# Define and fit model
tree = DecisionTreeClassifier()
# Grid search CV with 5-fold cross validation
tree_cv = GridSearchCV(tree, grid, cv=5)
tree_cv.fit(trainX, trainy)
# Plot the fitted tree
plot_tree(tree_cv.best_estimator_, filled=True, feature_names=list(cancer.feature_names))
plt.show()
# Print results
print("Hyperparameters (best max_depth):", tree_cv.best_params_)
print("Training CV Accuracy Score:", round(tree_cv.best_score_, 4))
print("Test Accuracy Score:", round(tree_cv.score(testX, testy), 4))
#### END YOUR SOLUTION HERE ####
Hyperparameters (best max_depth): {'max_depth': 4}
Training CV Accuracy Score: 0.9423
Test Accuracy Score: 0.9357
Exercise: Random Forest¶
Now we train a random forest model to the same dataset (for the same task) using the same training test split.
- Apply a random forest classifier with 100 trees to the data.
- Compute and print the training and test accuracies and compare it to the out of bag score (hint: set
oob_score = Truein classifier).
Hints:¶
Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
OOB: https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay
#### END YOUR SOLUTION HERE ####
# fit model
clf = RandomForestClassifier(n_estimators=100, oob_score=True)
clf = clf.fit(trainX, trainy)
# compute predictions for the training and test sets
pred_train = clf.predict(trainX)
pred = clf.predict(testX)
# compute the accuracy scores (test, training and OOB)
acc_test = accuracy_score(pred, testy)
acc_train = accuracy_score(pred_train, trainy)
acc_oob = clf.oob_score_
# print the computed scores
print( "Performance measurements", "\n",
"training accuracy : ", round(acc_train,3),"\n",
"test accuracy : ", round(acc_test,3), "\n",
"out of bag accuracy : ", round(acc_oob,3),"\n"
)
# Compute the confusion matrix
conf = confusion_matrix(testy, pred)
# Plot the confusion matrix using a heatmap
cm_display = ConfusionMatrixDisplay(conf).plot()
#### END YOUR SOLUTION HERE ####
Performance measurements training accuracy : 1.0 test accuracy : 0.953 out of bag accuracy : 0.952
Tune the number of trees parameter using grid search¶
Use grid search CV (5 folds) to find the best number of treees (estimators) using a grid from 100 to 1000 with a step of 100. Print the best number of trees and its corresponding test accuracy score and cross validation accuracy score.
# Define the grid for the number of trees
grid = {'n_estimators': np.arange(100,1000,100)}
# Do a grid search to find the optimal number of trees
rf = RandomForestClassifier(random_state = 42)
rf_cv = GridSearchCV(rf, grid, cv=5)
rf_cv.fit(trainX, trainy)
# print the best hyperparameter
print("Best Hyperparameter (number of trees):", rf_cv.best_params_)
# print the training CV accuracy score
print("Training CV Accuracy Score:", rf_cv.best_score_)
# print the test accuracy score
print("Test Accuracy Score:", rf_cv.score(testX, testy))
Best Hyperparameter (number of trees): {'n_estimators': 200}
Training CV Accuracy Score: 0.952246835443038
Test Accuracy Score: 0.9590643274853801
Importance plot¶
Use the permutation importance to compute the feature importances for the best model from the grid search CV.
Hints:¶
Forest importances: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
# retrieve the relative importance of each variable and visualize the importance plot
from sklearn.inspection import permutation_importance
#### START YOUR SOLUTION HERE ####
# get the best model from the grid search CV
best_rf_model = rf_cv.best_estimator_
# compute the feature importances using permutation test
perm_importances = permutation_importance(
best_rf_model, testX, testy, n_repeats=10, random_state=42, n_jobs=2)
# put them in a Series
forest_importances = pd.Series(perm_importances.importances_mean, index=cancer.feature_names)
# sort them (get the indices of the sorted array to be able to apply it on the errors)
sort_index = np.argsort(forest_importances)[::-1]
# plot the importances
fig, ax = plt.subplots()
forest_importances[sort_index].plot.bar(yerr=perm_importances.importances_std[sort_index], ax=ax)
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()
#### END YOUR SOLUTION HERE ####
Below we use the attribute feature_importances_ of random forest model selected in the grid search that quantifies the feature importance based on mean decrease in impurity. These scores, however, can be misleading for continuous and high cardinality features.
# get the feature importances from the fitted model
importances = best_rf_model.feature_importances_
# get the standard deviations
std = np.std([tree.feature_importances_ for tree in best_rf_model.estimators_], axis=0)
# put them in a Series
forest_importances = pd.Series(importances, index=cancer.feature_names)
# sort the importances (get the indices of the sorted array to be able to apply it on the errors)
sort_index = np.argsort(forest_importances)[::-1]
# plot them
fig, ax = plt.subplots()
forest_importances[sort_index].plot.bar(yerr=std[sort_index], ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
Decision Trees and Random Forest¶
Decision Trees¶
Decison trees are supervised learning method able to handle non linearly separable data. The tree is built based on the training set. At each node, the algorithm chooses a splitting rule (based on a feature) that optimizes a certain criterion previously given (e.g. Gini index, entropy).
The objective of the algorithm is to find the simplest possible decison tree (i.e. only a few nodes = a small depth) with high predictive quality (e.g. high accuracy).
Background¶
Intuition ¶
Decision trees, as the name goes, use a tree-like model of decisions. At each node, the algorithm chooses a splitting rule (based on a feature) that maximizes the accuracy of the model. More precisely, at every split the algorithm maximizes a certain criterion previously given (e.g., Gini index, information gain).
The objective of the algorithm is to find the simplest possible decision tree (i.e., only a few nodes = a small depth) with the highest accuracy.
Consider the example below, where the objective is to classify if a person is fit or not. If we would have chosen another criterion for the root node (e.g., "Exercises in the morning" intead of "Age<30"), we could have ended up with a lower accuracy and/or more splits (i.e., a more complex tree). The same logic applies at each decision node, until we reach the leafs, i.e., the final decision.
Decision Trees are simple to understand, interpret, and visualize. They can handle both numerical and categorical data, they do not require feature scaling and can deal with outliers. The algorithm is also good at at handling non-linearly separable data.
As a drawback, Decision Trees suffer for a risk of overfitting, especially with large dataset since the tree might become too complex. They can also be unstable because small variations in the data might result in a completely different tree being generated. Potential solutions to avoid overfitting and get better performance include:
- Rely on cross-validation to find the proper depth.
- Building a collection of trees, i.e., a Random Forest - you can for instance read Understanding Random Forest for more explanation on the topic.
Finally, note that, as for KNN, Decision Trees can be used for both classification and regression. You can read Machine Learning Basics: Decision Tree Regression for a walk through on how to apply Decision Tree for regression tasks.
Decision criteria ¶
Growing a tree involves deciding on which features to choose and what conditions to use for splitting, along with knowing when to stop. How to do so? We typically use decision criteria, which evaluate the "purity" of the selection. Here are some measures:
$$\text{Entropy}= - \sum_{i=1}^c p_i \log_2(p_i)$$where $c$ is the number of class and $p_i$ is the probability of randomly selecting an observation in class $i$. Let's consider two classes "0" and "1" for simplicity: $\text{Entropy} = - p_0 \log_2(p_0) - p_1 \log_2(p_1)$: - When our dataset (or node) has 50% of observations belonging to class "0" and 50% belonging to class "1", then $p_0=p_1=1/2$ and $\text{Entropy} = 1$. - When our dataset (or node) is "pure", say 0% of observations belonging to class "0" and 100% to class "1", then $p_0=0$, $p_1=1$, and $\text{Entropy} = 0$
Source: Brona, Wikipedia Binary entropy plot
At each decision node, we compute its associated Entropy. Our objective is to obtain pure leaf nodes, and thus to reduce the entropy in the children nodes.
- Gini Index, also called Gini Impurity, is an alternative decision criterion, inspired by the Gini coefficient. IT quantifies how often a randomly chosen example would be misclassified at a target node of the tree.The Gini of a dataset is:
Let's again consider two classes for simplicity "0" and "1", $\text{Gini}=1-p_0^2-p_1^2$ - When our dataset (or node) has 50% of observations belonging to class "0" and 50% belonging to class "1", then $p_0=p_1=1/2$ and $\text{Gini} = 0.5$. - When our dataset (or node) is "pure", say 0% of observations belonging to class "0" and 100% to class "1", then $p_0=0$, $p_1=1$, and $\text{Gini} = 0$
At each decision node we compute the associated Gini Index, and then the average Gini Index of the split. Our objective is to minimize the Gini Index.
For further information on the topic, you can read the articles:
Decision Trees Example¶
We now present a simple example for demonstrating how to use the decision tree classifier from sklearn.
First we load the needed libraries and make some plot settings.
# Import needed libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix, accuracy_score
# Customize plots
%matplotlib inline
sns.set_theme(style="white")
# Change style (to make the tree plots look nice)
plt.style.use('classic')
In the code below we create a three simple example datasets. The goal is to learn a model that predicts the salary class (0 or 1) based on some existing individual characteristics.
# set example equal to 1, 2 and 3 to look into the different examples
example = 1 # 1, 2 or 3
if example == 1:
data = {"Degree":["Apprenticeship", "Apprenticeship", "Master", "Bachelor", "Master", "Apprenticeship", "Bachelor", "Bachelor", "Master", "Master"],
"Sex":[1, 1, 0, 1, 0, 1, 1, 0, 1, 1],
"Salary Class":[0, 0, 1, 0, 1, 0, 0, 1, 1, 1]}
data = pd.DataFrame(data)
elif example == 2:
data = {"Age":[20, 16, 50, 23, 36, 33, 41, 22, 27, 57],
"Degree":["Apprenticeship", "School", "Master", "Bachelor", "Bachelor", "Apprenticeship", "Bachelor", "Bachelor", "Master", "Master"],
"Sex":[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
"Salary Class":[0, 0, 0, 0, 1, 1, 0, 0, 1, 1]}
data = pd.DataFrame(data)
elif example == 3:
data = {"Degree":["Apprenticeship", "School", "Master", "Bachelor", "Bachelor", "Apprenticeship", "Bachelor", "Bachelor", "Master", "Master"],
"Sex":[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
"Salary Class":[1, 1, 0, 1, 1, 0, 0, 0, 1, 1]}
data = pd.DataFrame(data)
else:
data = np.nan
raise ValueError("'example' should be 1, 2 or 3")
data
| Degree | Sex | Salary Class | |
|---|---|---|---|
| 0 | Apprenticeship | 1 | 0 |
| 1 | Apprenticeship | 1 | 0 |
| 2 | Master | 0 | 1 |
| 3 | Bachelor | 1 | 0 |
| 4 | Master | 0 | 1 |
| 5 | Apprenticeship | 1 | 0 |
| 6 | Bachelor | 1 | 0 |
| 7 | Bachelor | 0 | 1 |
| 8 | Master | 1 | 1 |
| 9 | Master | 1 | 1 |
We have three example datasets.
Example 1:
- All "Master" belong to class 1.
- Among the rest, if sex == 0, then class 1.
- A human could do the classification (easy).
Example 2:
- More difficult to see something...
- Hint: look at young people...
Example 3:
- Illustrates that it is sometimes difficut to classify.
- This is because of lack of pattern in the data.
- If there is nothing to discover, then the algorithm will discover nothing...
- The tree for this model illustrates this well.
Use the OneHotEncoder to encode the categorical feature (Degree) into 0-1 encoding and add them back to the dataframe.
# Encode the categorical feature into 0-1 encoding using OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
one_hot = OneHotEncoder()
one_hot_degree = one_hot.fit_transform(data[["Degree"]]).toarray()
one_hot_degree = pd.DataFrame(one_hot_degree, columns=one_hot.get_feature_names_out())
one_hot_degree
| Degree_Apprenticeship | Degree_Bachelor | Degree_Master | |
|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 1.0 |
| 3 | 0.0 | 1.0 | 0.0 |
| 4 | 0.0 | 0.0 | 1.0 |
| 5 | 1.0 | 0.0 | 0.0 |
| 6 | 0.0 | 1.0 | 0.0 |
| 7 | 0.0 | 1.0 | 0.0 |
| 8 | 0.0 | 0.0 | 1.0 |
| 9 | 0.0 | 0.0 | 1.0 |
# Add (concatenate) your one-hot encoded features back in the dataframe
data_tree = pd.concat([data, one_hot_degree], axis=1)
data_tree
| Degree | Sex | Salary Class | Degree_Apprenticeship | Degree_Bachelor | Degree_Master | |
|---|---|---|---|---|---|---|
| 0 | Apprenticeship | 1 | 0 | 1.0 | 0.0 | 0.0 |
| 1 | Apprenticeship | 1 | 0 | 1.0 | 0.0 | 0.0 |
| 2 | Master | 0 | 1 | 0.0 | 0.0 | 1.0 |
| 3 | Bachelor | 1 | 0 | 0.0 | 1.0 | 0.0 |
| 4 | Master | 0 | 1 | 0.0 | 0.0 | 1.0 |
| 5 | Apprenticeship | 1 | 0 | 1.0 | 0.0 | 0.0 |
| 6 | Bachelor | 1 | 0 | 0.0 | 1.0 | 0.0 |
| 7 | Bachelor | 0 | 1 | 0.0 | 1.0 | 0.0 |
| 8 | Master | 1 | 1 | 0.0 | 0.0 | 1.0 |
| 9 | Master | 1 | 1 | 0.0 | 0.0 | 1.0 |
# Select X and y
X = data_tree.drop(["Degree", "Salary Class"], axis=1)
y = data_tree["Salary Class"]
# Classification
from sklearn.tree import DecisionTreeClassifier, plot_tree
tree = DecisionTreeClassifier()
tree.fit(X, y)
tree.score(X, y)
1.0
Let's plot our decision tree.
It starts with the root in which we have 10 samples (our data points) of which 5 belong to class 0 and 5 belong to class 1.
Each node represents a condition on which the tree splits into branches. The end of a branch that no longer splits is a leaf, in this case salary class 0 (orange) or 1 (blue).
The Gini coefficient is our measure of purity of each node. In our dataset we start with 0.5 (corresponds to the 50-50 distribution of classes in the root) and then gradually go down to 0 (maximum purity).
plt.figure(figsize=(5,5))
plot_tree(tree, filled=True);
# To see the splits
pd.concat([X, y], axis=1).head(2)
| Sex | Degree_Apprenticeship | Degree_Bachelor | Degree_Master | Salary Class | |
|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 0.0 | 0.0 | 0 |
| 1 | 1 | 1.0 | 0.0 | 0.0 | 0 |
What is the depth of this tree?
tree.get_depth()
2
Have a look at the parameters of the DecisionTreeClassifier sklearn documentation for this classifier.
One can, for example limit the maximum depth of my tree with max_depth or set a minimum number of observations in each leaf with min_samples_leaf.
With sklearn, the default criterion for determining the split at each node is the Gini index. It is based on this criterion that the algorithm choses which feature and what condition on this feature to use to make the best split. For more info you can read this Medium post.
Drug Classification Example¶
We classify people into drug categories according to a set of individual characteristics (blood pressure, age, cholosterol, ...).
Load Data¶
# Load dataset
df = pd.read_csv("drug200.csv")
df.head()
| Age | Sex | BP | Cholesterol | Na_to_K | Drug | |
|---|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | HIGH | 25.355 | DrugY |
| 1 | 47 | M | LOW | HIGH | 13.093 | drugC |
| 2 | 47 | M | LOW | HIGH | 10.114 | drugC |
| 3 | 28 | F | NORMAL | HIGH | 7.798 | drugX |
| 4 | 61 | F | LOW | HIGH | 18.043 | DrugY |
The variables:
- Age: Age of patient
- Sex: Gender of patient
- BP: Blood pressure of patient
- Cholesterol: Cholesterol of patient
- Na_to_K: Sodium to Potassium Ratio in Blood
- Drug: Drug Type
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 200 non-null int64 1 Sex 200 non-null object 2 BP 200 non-null object 3 Cholesterol 200 non-null object 4 Na_to_K 200 non-null float64 5 Drug 200 non-null object dtypes: float64(1), int64(1), object(4) memory usage: 9.5+ KB
Basic Data Analysis¶
df.describe()
| Age | Na_to_K | |
|---|---|---|
| count | 200.000000 | 200.000000 |
| mean | 44.315000 | 16.084485 |
| std | 16.544315 | 7.223956 |
| min | 15.000000 | 6.269000 |
| 25% | 31.000000 | 10.445500 |
| 50% | 45.000000 | 13.936500 |
| 75% | 58.000000 | 19.380000 |
| max | 74.000000 | 38.247000 |
# Analysis of the data (univariate)
fig, ax = plt.subplots(3, 2, figsize=(20,15))
i = 0
j = 0
for var in df:
if df[var].dtypes == "object":
sns.countplot(x=df[var], ax=ax[i, j])
else:
sns.histplot(df[var], ax=ax[i, j])
i += 1
if i == 3:
i = 0
j += 1
plt.show()
Show the relative frequency of the different drug classes.
df.Drug.value_counts(normalize=True)
Drug DrugY 0.455 drugX 0.270 drugA 0.115 drugC 0.080 drugB 0.080 Name: proportion, dtype: float64
Look into any dependencies between the target Drug and the numeric input features: Age-Drug and Na_to_K-Drug.
plt.figure(figsize=(10,10))
sns.pairplot(df, hue="Drug")
/Users/bogo/opt/anaconda3/envs/MachLe310/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x7fd081cc7280>
<Figure size 800x800 with 0 Axes>
Look into any dependencies between the target Drug and the categorc input features: Cholesterol-Drug and BP-Drug.
# Cholesterol-Drug
# Counts of the different combinations
df_CH_Drug = df.groupby(["Drug","Cholesterol"]).size().reset_index(name = "Count")
# Barplot of the counts
plt.figure(figsize = (9,5))
sns.barplot(x = "Drug",y="Count", hue = "Cholesterol",data = df_CH_Drug)
plt.title("Cholesterol -- Drug")
plt.show()
# Blood Pressure (BP) - Drug
# Counts of the different combinations
df_BP_Drug = df.groupby(["Drug","BP"]).size().reset_index(name = "Count")
# Barplot of the counts
plt.figure(figsize = (9,5))
sns.barplot(x = "Drug",y="Count", hue = "BP",data = df_BP_Drug)
plt.title("BP -- Drug")
plt.show()
We see that all people with a Na_to_K ratio above around 15 take DrugY. This will be useful for classification.
Prepare data for the algorthm¶
df
| Age | Sex | BP | Cholesterol | Na_to_K | Drug | |
|---|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | HIGH | 25.355 | DrugY |
| 1 | 47 | M | LOW | HIGH | 13.093 | drugC |
| 2 | 47 | M | LOW | HIGH | 10.114 | drugC |
| 3 | 28 | F | NORMAL | HIGH | 7.798 | drugX |
| 4 | 61 | F | LOW | HIGH | 18.043 | DrugY |
| ... | ... | ... | ... | ... | ... | ... |
| 195 | 56 | F | LOW | HIGH | 11.567 | drugC |
| 196 | 16 | M | LOW | HIGH | 12.006 | drugC |
| 197 | 52 | M | NORMAL | HIGH | 9.894 | drugX |
| 198 | 23 | M | NORMAL | NORMAL | 14.020 | drugX |
| 199 | 40 | F | LOW | NORMAL | 11.349 | drugX |
200 rows × 6 columns
Label encoding using OrdinalEncoder combined with ColumnTransformer (both from sklearn).
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
# instantiate encoder
oe=OrdinalEncoder()
# select variables for label encoding
categorical_cols=['Sex', 'BP', 'Cholesterol', 'Drug']
# set up your preprocessor (name, transformer, columns to transform)
# remainder=passthrough means we keep the remaining features
preprocessor = ColumnTransformer([('categorical', oe, categorical_cols)], remainder='passthrough')
# fit pre-processor
encoded_df = pd.DataFrame(preprocessor.fit_transform(df), columns=['Sex', 'BP', 'Cholesterol', 'Drug', 'Age', 'Na_to_K'])
First, create the inputs and targets, then create the training and test splits.
# Create inputs X and target y
X = encoded_df.drop(["Drug"], axis=1)
y = encoded_df.Drug
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=72)
X_train
| Sex | BP | Cholesterol | Age | Na_to_K | |
|---|---|---|---|---|---|
| 90 | 1.0 | 2.0 | 0.0 | 62.0 | 16.594 |
| 163 | 0.0 | 0.0 | 1.0 | 21.0 | 28.632 |
| 76 | 0.0 | 0.0 | 0.0 | 36.0 | 11.198 |
| 113 | 0.0 | 1.0 | 1.0 | 65.0 | 13.769 |
| 98 | 1.0 | 0.0 | 1.0 | 20.0 | 35.639 |
| ... | ... | ... | ... | ... | ... |
| 69 | 0.0 | 0.0 | 1.0 | 18.0 | 24.276 |
| 101 | 0.0 | 0.0 | 0.0 | 45.0 | 12.854 |
| 74 | 1.0 | 0.0 | 1.0 | 31.0 | 17.069 |
| 46 | 0.0 | 0.0 | 0.0 | 37.0 | 13.091 |
| 19 | 0.0 | 0.0 | 1.0 | 32.0 | 25.974 |
130 rows × 5 columns
Calculate the base rate using DummyClassifier with the most_frequent strategy meaning assigning the most frequent class as prediction to all samples.
from sklearn.dummy import DummyClassifier
# instantiate with the "most frequent" parameter
dummy = DummyClassifier(strategy='most_frequent')
# fit it as if we had no X features to train it on
dummy.fit(None, y_train)
#compute test baseline and store it for later
baseline = dummy.score(None, y_test)
baseline
0.42857142857142855
Decision Tree¶
Fit a decision tree model on the training set, use it to compute the predictions on the test set and finally evaluate the model by computing the accuracy score and plotting the confusion matrix.
# Fit model
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
# Predict
y_pred = tree.predict(X_test)
# Evaluate model
def accuracy_conf_mat(y_test, y_pred):
print("Accuracy score:", round(accuracy_score(y_test, y_pred), 4))
conf_mat = confusion_matrix(y_test, y_pred)
cm_display = ConfusionMatrixDisplay(conf_mat).plot()
accuracy_conf_mat(y_test, y_pred)
Accuracy score: 0.9714
plt.figure(figsize=(12, 12))
plot_tree(tree, filled=True);
tree.get_depth()
4
# Accuracy in training set
tree.score(X_train, y_train)
1.0
In this simple example we are not in the case of overfitting here since we obtain high accuracy on the test data. With real life data, this is will probably never be the case... Let's try with different tree depths and look at the fitted trees.
# Try out tree depths from 2 to 5
for depth in [2, 3, 4, 5]:
tree = DecisionTreeClassifier(max_depth=depth).fit(X_train, y_train)
y_pred = tree.predict(X_test)
print("Depth: " + str(depth))
print(round(accuracy_score(y_test, y_pred), 2))
plt.figure(figsize=(6, 6))
plot_tree(tree, filled=True)
plt.show()
print("\n\n\n\n")
Depth: 2 0.83
Depth: 3 0.81
Depth: 4 0.97
Depth: 5 0.97
And this is how we can set up grid search to tune the optimal value of the maximum tree depth.
# Grid Search - tuning tree depth
from sklearn.model_selection import GridSearchCV
# Define parameter to test - max_depth from 1 to 6
grid = {'max_depth':np.arange(1,7)}
# Define and fit model
tree = DecisionTreeClassifier()
# Grid search CV with 5-fold cross validation
tree_cv = GridSearchCV(tree, grid, cv=5)
tree_cv.fit(X_train, y_train)
# Print results
print("Hyperparameters (best max_depth):", tree_cv.best_params_)
print("Training CV Accuracy Score:", round(tree_cv.best_score_, 4))
print("Test Accuracy Score:", round(tree_cv.score(X_test, y_test), 4))
Hyperparameters (best max_depth): {'max_depth': 4}
Training CV Accuracy Score: 0.9923
Test Accuracy Score: 0.9714
Random Forest¶
Now we will train a random forest classifier on the same dataset and evaluate its performance.
from sklearn.ensemble import RandomForestClassifier
# fit a random forest classfier
rfc = RandomForestClassifier(random_state = 42)
rfc.fit(X_train,y_train)
# Compute the predictions on the test set
y_test = rfc.predict(X_test)
# Evaluate Model
accuracy_conf_mat(y_test, y_pred)
Accuracy score: 1.0
We will use grid search to tune the hyperparameters n_estimators and criterion (splitting criterions for the decision trees) of the random forest classifier.
grid = {'n_estimators': np.arange(100,1000,100),
'criterion': ['gini','entropy']
}
rf = RandomForestClassifier(random_state=42)
rf_cv = GridSearchCV(rf, grid, cv=5)
rf_cv.fit(X_train,y_train)
print("Hyperparameters:", rf_cv.best_params_)
print("Training CV Accuracy Score:", rf_cv.best_score_)
print("Test Accuracy Score:", rf_cv.score(X_test,y_test))
Hyperparameters: {'criterion': 'gini', 'n_estimators': 100}
Training CV Accuracy Score: 0.9923076923076923
Test Accuracy Score: 1.0
# Import standard libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Import to load arff file from url
from scipy.io import arff
import urllib.request
import io
# Sklearn import
from sklearn.model_selection import train_test_split # Splitting the data set
from sklearn.preprocessing import MinMaxScaler, StandardScaler # Normalization and standard scaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Label and 1-hot encoding
from sklearn.linear_model import LogisticRegression # Logistic regression model
from sklearn.linear_model import LogisticRegressionCV # Logistic regression with cross-validation
from sklearn.metrics import accuracy_score # Accuracy
from sklearn.metrics import confusion_matrix # Confusion matrix
from sklearn.metrics import precision_score, recall_score, f1_score # Precision, recall, and f1 score
Logistic regression¶
Suppose we have n observations of an outcome $\boldsymbol{y}$ and d associated features $\boldsymbol{x_1}$, $\boldsymbol{x_2}$, ... , $\boldsymbol{x_d}$ (note that $\boldsymbol{y}$, $\boldsymbol{x_1}$, ..., $\boldsymbol{x_d}$ are vectors):
| Outcome | Feature 1 | Feature 2 | ... | Feature d | |
|---|---|---|---|---|---|
| Observation 1 | $y_1$ | $x_{11}$ | $x_{12}$ | ... | $x_{1d}$ |
| Observation 2 | $y_2$ | $x_{21}$ | $x_{22}$ | ... | $x_{2d}$ |
| ... | ... | ... | ... | ... | ... |
| Observation n | $y_n$ | $x_{n1}$ | $x_{n2}$ | ... | $x_{nd}$ |
We will focus on binary classification for now. In other words, our outcome can take two values, 0 and 1, which represent two classes (e.g., cat or dog, spam email or not, risky or safe loan, etc.).
Remember when we did multivariate linear regression, we assumed that our model function $f_{\text{mv}}$, i.e., our prediction, was a linear combination of our features. For each observation $i$, we assumed: $$f_{\text{mv}}(\boldsymbol{X_{i*}}, \boldsymbol{w}):=w_0 + w_1 x_{i,1} + w_2 x_{i,2} + ... + w_d x_{i,d}$$ with $\boldsymbol{w}=(w_0, w_1, ..., w_d)$ the vector of weights, and $\boldsymbol{X}=[\boldsymbol{x_1}$, ... , $\boldsymbol{x_d}]$ the matrix of feature variables.
For each observation, our true outcome was $y_i = f_{\text{mv}}(\boldsymbol{X_{i*}}, \boldsymbol{w}) + \epsilon_i$, and our goal was to minimize the errors.
In this setting, our model function $f_{\text{mv}}$ can take any values. It is thus suited when our outcome is continuous. However, with binary classification, we are dealing with discrete values, and more precisely with 0 and 1. How can we modify our model to obtain proper predictions?
The idea of logistic regression is to transform the predictions obtained with a linear regression such that the predictions are between 0 and 1. To do so, we rely on the Sigmoid (logistic) function:
$$S(x) = \frac{1}{1 + e^{-x}}$$Source: Qef, from Wikipedia Logistic Curve plot
With logistic regression, we apply the sigmoid function to the output of the multivariate regression model. Let $f_{\text{logi}}$ be the prediction function of a logistic regression, we have:
$$f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w}):= \frac{1}{1 + e^{-(w_0 + w_1 x_{i,1} + w_2 x_{i,2} + ... + w_d x_{i,d})}}$$$f_{\text{logi}}$ represents the probability that a given observation belongs to class 1, i.e., $y_i=1$:
- We predict that the observation belongs to class 1 when $f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w}) \geq 0.5$, i.e., when $w_0 + w_1 x_{i,1} + w_2 x_{i,2} + ... + w_d x_{i,d} \geq 0$;
- Reciprocally, we predict that the observation belongs to class 0 when $f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w})<0.5$, i.e., $w_0 + w_1 x_{i,1} + w_2 x_{i,2} + ... + w_d x_{i,d}<0$.
Now our problem is the same as before: we want to minimize the errors of our model, learning the weights $w_0$, $w_1$, ..., $w_d$ from our data. To do so, we are minimizing our loss function... but which one? We will explore one option below.
Logistic Loss function¶
For linear regression, we used the Least Squared Error as loss function:
$$ \min_\boldsymbol{w} \sum_{i=1}^n (y_i - f_{\text{mv}}(\boldsymbol{X_{i*}}, \boldsymbol{w}))^2 $$Can we use the same for logistic regression? No! Indeed, using Least Squared Error with our new prediction function $f_{\text{mv}}$ would result in a non-convex graph, which is not ideal for our minimization problem since we could be stuck in local minima:
Source: Issam Laradji, Non-convex Optimization
So which loss function can we use? Ideally, we want to assign more punishment when predicting 1 while the actual value is 0 and when predicting 0 while the actual value is 1. One such function is the... Logistic Loss:
$$L(\boldsymbol{y}, \boldsymbol{X}, \boldsymbol{w})= -\frac{1}{n} \sum_{i=1}^n [y_i \log(f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w})) + (1-y_i) \log(1-f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w}))] $$Let's decompose our function to understand a bit more how it works. For each observation $i$, the cost is:
$$\text{Cost}_i = - y_i \log(f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w})) - (1-y_i) \log(1-f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w}))$$- When $y_i = 1$, $\text{Cost}_i = - \log(f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w}))$. Hence, if our predicted probability is 1, we have $\text{Cost}_i=0$, i.e., no cost. However, when our predicted probability is approaching 0, our cost goes to infinity (because the logarithm goes to minus infinity when we get closer to zero).
- When $y_i = 0$, $\text{Cost}_i = - \log(1-f_{\text{logi}}(\boldsymbol{X_{i*}}, \boldsymbol{w}))$, and it works the other way around. If our predicted probability is zero, the cost is zero, but when our predicted probability is approaching 1, our cost goes to infinity.
Source: Shuyu Luo, Loss Function (Part II): Logistic Regression
The Logistic Loss not only punishes errors with a very large cost, it is also convex. Hence, we can apply Gradient Descent to obtain the model parameters that minimize the loss!
To learn more:
- Loss Function (Part II): Logistic Regression, by Shuyu Luo, Published in Towards Data Science
- Understanding the log loss function, by Susmith Reddy, Published in Analytics Vidhya
Implementation¶
For the walkthough we will use a dataset on wine quality.
The wine data set consists of 11 different parameters of wine such as alcohol content, acidity, and pH, which were measured for several wine samples from the North of Portugal.
Source: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. Dataset obtained from UCI Machine Learning repository, Wine Quality Data Set.
These wines were derived from different cultivars; therefore there have different quality, as a score between 0 and 10. We grouped the wines into two quality classes: 0 and 1, representing respectively "poor quality" (score 0-5), and "good quality" (score 6-10).
Our goal here is to find a model that can predict the class of wine given the 11 measured parameters, and find out the major differences among the two classes.
Load and discover dataset¶
#Load the dataset
wines = pd.read_csv("wine-quality-red.csv")
# Display a sample of the data
display(wines.head())
# Print columns
print(wines.columns.values)
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 0 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 0 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 0 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 1 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 0 |
['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar' 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH' 'sulphates' 'alcohol' 'quality']
Note that we only have numerical variables, and thus won't need to encode any categorical variables.
However, we will need to rescale our features since, they are on different scales: for instance, chlorides values are lower than 1 while sulfur dioxide can attain a value of 289.
wines.describe()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
| mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.875547 | 46.468418 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 0.534709 |
| std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460434 | 32.895920 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.498950 |
| min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 0.000000 |
| 25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 0.000000 |
| 50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 1.000000 |
| 75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 1.000000 |
| max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 1.000000 |
We now define our features - all wine parameters - and our target variable - the wine quality:
# Define features and target variable
X = wines.drop(columns='quality')
y = wines['quality']
We know check how many observations we have for each class:
# Count the number of observations (rows) corresponding to each value
y.value_counts()
quality 1 855 0 744 Name: count, dtype: int64
We have 855 "good" quality wines and 744 "poor" quality wines. The number of observations for each class influence the quality of our predictions. Here, our dataset is reasonably balanced.
Splitting the dataset¶
As always, the first step is to split our data into random training and test subsets. Recall that the training set is used to learn the parameters of our model while the test set is used to evaluate our predictions.
We use the train_test_split (Documentation) of sklearn and reserve 25% of the original data as test set.
#Split data set into a train and a test data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, shuffle=True)
print(f"The training set has {X_train.shape[0]} observations, and the test set has {X_test.shape[0]} observations.")
The training set has 1199 observations, and the test set has 400 observations.
Rescaling¶
When we have a dataset with features with very distinct ranges, many ML models (including logistic regression) might produce biased results. Thus we want the features to be in the same or similar range, which also helps the interpretation of the model parameters (weights). We therefore normalize the data. In our example solution below we will normalize both our train AND test data using the MinMaxScaler()(Documentation).
# Define the scaler
scaler = MinMaxScaler()
# Fit the scaler
scaler.fit(X_train) # here the scaler learns the min and max of each attribute from the training set
# Transform the train and the test set
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Building and training our classifier¶
To predict the class of our target variable we use a logistic regression. The sklearn module is called LogisticRegression() (Documentation).
Note that L2-regularization is applied by default. By specifying the argument penalty, you can specify the regularization techniques, namely 'l1', 'l2', 'elasticnet', or None.
You can also specify the solver. By default, 'lbfgs' is used, which stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno. Note that the choice of the algorithm depends on the penalty chosen. You can refer to the documentation for insights on the choice of solver/penalty depending on your problem and data.
A short note on solver: L-BFGS approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS), which is based on Newton's method, an alternative to Gradient Descent. While the Gradient Descent rely on the gradient (first-order derivatives) to update our parameters, Newton's method also makes use of the Hessian matrix, i.e., the second-order derivatives. Newton's method generally converges faster than Gradient Descent. However, Newton's method is computationally-expensive and the Hessian might not even exist. Hence, numerical methods called Quasi-Newton, such as BFGS, have been developed to solve optimization problems.
# 1. Set up our model
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
# 2. Fit our model
model.fit(X_train, y_train)
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000)
After fitting the model, we can easily retrieve the values of the different weights coefficients: the intercept with .intercept_, and the weights of each feature with .coef_.flatten():
# Dataframe with the intercept and coefficients (weights) of the logistic model
model_coeff = pd.DataFrame(np.concatenate((model.intercept_, model.coef_.flatten())),
index=["Intercept"] + list(X.columns.values),
columns=['Coefficients logistic model'])
model_coeff
| Coefficients logistic model | |
|---|---|
| Intercept | -0.096618 |
| fixed acidity | 1.060667 |
| volatile acidity | -3.335574 |
| citric acid | -0.269326 |
| residual sugar | 0.596732 |
| chlorides | -1.266971 |
| free sulfur dioxide | 0.391015 |
| total sulfur dioxide | -2.752863 |
| density | -0.782913 |
| pH | -0.245513 |
| sulphates | 2.852787 |
| alcohol | 4.432689 |
It seems like the level of alcohol and volatile acidity were the most important features to predict the wine quality, at least in our model.
Using the classifier to make prediction¶
Once our model has been trained, we can use predict() to predict new values. We predict the values from the test set to then evaluate the model, estimating the accuracy of our classifier.
y_pred = model.predict(X_test)
We can even access the probabilities that one observation belongs to one class or the other with predict_proba(). The largest probability determines the predicted class.
# Dataframe with probabilities that our first 5 observations belong to each class
model_proba = pd.DataFrame(model.predict_proba(X_test)[0:4],
columns=['Probability poor-quality wine', 'Probability good wine'])
model_proba
| Probability poor-quality wine | Probability good wine | |
|---|---|---|
| 0 | 0.352811 | 0.647189 |
| 1 | 0.727663 | 0.272337 |
| 2 | 0.089182 | 0.910818 |
| 3 | 0.586545 | 0.413455 |
Evaluating our classifier¶
We will now evaluate the performance of our classifier using several metrics.
Accuracy¶
For a sklearn classifier, this can be computed using the score method.
# Accuracy on the test set
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(model.score(X_test, y_test)))
# Accuracy on the training set
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(model.score(X_train, y_train)))
Accuracy of Logistic regression classifier on test set: 0.75 Accuracy of Logistic regression classifier on training set: 0.74
Alternatively, we could use the accuracy_score module:
accuracy_test = accuracy_score(y_test, y_pred)
print(f'Accurary of Logistic regression classifier on test set: {accuracy_test :.2f}')
Accurary of Logistic regression classifier on test set: 0.75
When the testing accuracy is much lower than the training accuracy, we have an overfitting issue. Reciprocally, when the testing accuracy is similar to or higher than the training accuracy, the model might be underfitting, and we could consider either using a more powerful model or adding additional features.
Our testing accuracy is 75%. Is that good? It depends! The quality of our prediction depends on the distribution of the classes in our original data:
y.value_counts().plot.bar(color=['purple', 'blue'], grid=False)
plt.ylabel('Number of observations')
plt.title('Number of observations of each class in the wine dataset');
Imagine we have a naive classifier that always predict the majority class. We call the default rate (or base rate) the accuracy of this classifier, which is equal to the size of the most common class over the size of the full dataset:
$$\text{Default rate} = \frac{\# \text{ most frequent class}}{\# \text{ total observations}}$$If the default rate is too high, then the classification can be biased, meaning that the data set has too many observations of one class compared to the other classes, and has hence more impact on the classification results.
The accuracy of our classifier should be better than the default rate. Let's calculate this default rate!
# Compute the default rate
quality_0 = wines.loc[wines["quality"] == 0].shape[0]
print('# occurrence of class 0: ', quality_0)
quality_1 = wines.loc[wines["quality"] == 1].shape[0]
print('# occurence of class 1: ', quality_1)
defaultrate = max(quality_0, quality_1)/(wines["quality"].shape[0])
print(f'Default rate = {defaultrate:0.4f}')
# occurrence of class 0: 744 # occurence of class 1: 855 Default rate = 0.5347
The default rate for our task is about 53.5% while our classifier accuracy is 75%. Not too bad!
Confusion matrix¶
The confusion matrix allows us to get more details on the performance of our model. It will allow us to see what our classification model is getting right and what types of errors it is making. So let's compute it. It requires as input the true values and the predicted values:
confusion_matrix(y_test, y_pred)
array([[137, 48],
[ 52, 163]])
To obtain a more visual representation, we will use heatmap from the seaborn library:
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='Blues', fmt='.4g')
plt.xlabel('Predicted label')
plt.ylabel('True labels')
plt.title('Confusion Matrix');
Precision, Recall, F Score¶
We will compute the precision using precision_score (Documentation), the recall using recall_score (Documentation), and the F1 score using f1_score (Documentation).
For a binary classifier, all metrics will report by default the scores associated with the positive class (i.e., with observations equal to 1). If we are interested in the results for another class, we need to specify this in the parameters. For instance, the parameter average = None will return the scores of each class:
print('The precision for class 1 (good wines) is: {:0.3f}'.format(precision_score(y_test, y_pred)))
print('The recall for class 1 is: {:0.3f}'.format(recall_score(y_test, y_pred)))
print('The F1 score for class 1 is: {:0.3f}'.format(f1_score(y_test, y_pred)))
The precision for class 1 (good wines) is: 0.773 The recall for class 1 is: 0.758 The F1 score for class 1 is: 0.765
# Precision of each class
model_precision = precision_score(y_test, y_pred, average = None)
# Recall of each class
model_recall = recall_score(y_test, y_pred, average = None)
# F1 score of each class
model_f1 = f1_score(y_test, y_pred, average = None)
# Visualize all results in a dataframe:
model_eval = pd.DataFrame([model_precision, model_recall, model_f1],
index = ['Precision', 'Recall', 'F1 score'],
columns=['Class 0', 'Class 1'])
model_eval
| Class 0 | Class 1 | |
|---|---|---|
| Precision | 0.724868 | 0.772512 |
| Recall | 0.740541 | 0.758140 |
| F1 score | 0.732620 | 0.765258 |
You can find all the sklearn model evaluation metrics here.
Your turn !¶
Now it's your turn to use logistic regression! In this application, you will try to predict whether a forest fire spread and burned forest areas in the Montesinho natural park in Portugal.
We are using the Forest Fires dataset, created by Paulo Cortez and Aníbal Morais, and available on Kaggle.
Source: P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.
The original dataset contains 13 columns:
- X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
- Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
- month - month of the year: "jan" to "dec"
- day - day of the week: "mon" to "sun"
- FFMC - Fine Fuel Moisture Code (FFMC) index from the Fire Weather Index (FWI) system: 18.7 to 96.20
- DMC - Duff Moisture Code (DMC) index from the FWI system: 1.1 to 291.3
- DC - Drought Code (DC) index from the FWI system: 7.9 to 860.6
- ISI - Initial Spread Index (ISI) index from the FWI system: 0.0 to 56.10
- temp - temperature in Celsius degrees: 2.2 to 33.30
- RH - relative humidity in %: 15.0 to 100
- wind - wind speed in km/h: 0.40 to 9.40
- rain - outside rain in mm/m2 : 0.0 to 6.4
- area - the burned area of the forest (in ha): 0.00 to 1090.84
In addition, we created a new column, "class", detailing whether the fire burned an area of forest:
- class is equal to 0 if area = 0.00 ha
- class is equal to 1 if area > 0.00 ha
Our goal will be to predict the *class* using logistic regression, given the weather and FWI features.
# Load data
forest_fire = pd.read_csv("forestfires.csv")
Discover your dataset¶
- Explore your dataset, displaying a few observations, the types of your data, some summary statistics, and the correlation matrix. Feel free to push forward your EDA using a few graphs e.g., boxplot and pairplot.
# YOUR CODE HERE
# Display a sample of the data
display(forest_fire.head())
# Display the data types
display(forest_fire.dtypes)
| X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | area | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7 | 5 | mar | fri | 86.2 | 26.2 | 94.3 | 5.1 | 8.2 | 51 | 6.7 | 0.0 | 0.0 | 0 |
| 1 | 7 | 4 | oct | tue | 90.6 | 35.4 | 669.1 | 6.7 | 18.0 | 33 | 0.9 | 0.0 | 0.0 | 0 |
| 2 | 7 | 4 | oct | sat | 90.6 | 43.7 | 686.9 | 6.7 | 14.6 | 33 | 1.3 | 0.0 | 0.0 | 0 |
| 3 | 8 | 6 | mar | fri | 91.7 | 33.3 | 77.5 | 9.0 | 8.3 | 97 | 4.0 | 0.2 | 0.0 | 0 |
| 4 | 8 | 6 | mar | sun | 89.3 | 51.3 | 102.2 | 9.6 | 11.4 | 99 | 1.8 | 0.0 | 0.0 | 0 |
X int64 Y int64 month object day object FFMC float64 DMC float64 DC float64 ISI float64 temp float64 RH int64 wind float64 rain float64 area float64 class int64 dtype: object
# Summary statistics
display(forest_fire.describe())
# Correlation matrix visualized in heat map
sns.heatmap(forest_fire.corr(numeric_only = True).round(decimals=1), annot=True, cmap="bwr")
plt.title('Correlation matrix')
plt.show()
| X | Y | FFMC | DMC | DC | ISI | temp | RH | wind | rain | area | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 517.000000 | 517.000000 | 517.000000 | 517.000000 | 517.000000 | 517.000000 | 517.000000 | 517.000000 | 517.000000 | 517.000000 | 517.000000 | 517.000000 |
| mean | 4.669246 | 4.299807 | 90.644681 | 110.872340 | 547.940039 | 9.021663 | 18.889168 | 44.288201 | 4.017602 | 0.021663 | 12.847292 | 0.522244 |
| std | 2.313778 | 1.229900 | 5.520111 | 64.046482 | 248.066192 | 4.559477 | 5.806625 | 16.317469 | 1.791653 | 0.295959 | 63.655818 | 0.499989 |
| min | 1.000000 | 2.000000 | 18.700000 | 1.100000 | 7.900000 | 0.000000 | 2.200000 | 15.000000 | 0.400000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 3.000000 | 4.000000 | 90.200000 | 68.600000 | 437.700000 | 6.500000 | 15.500000 | 33.000000 | 2.700000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 4.000000 | 4.000000 | 91.600000 | 108.300000 | 664.200000 | 8.400000 | 19.300000 | 42.000000 | 4.000000 | 0.000000 | 0.520000 | 1.000000 |
| 75% | 7.000000 | 5.000000 | 92.900000 | 142.400000 | 713.900000 | 10.800000 | 22.800000 | 53.000000 | 4.900000 | 0.000000 | 6.570000 | 1.000000 |
| max | 9.000000 | 9.000000 | 96.200000 | 291.300000 | 860.600000 | 56.100000 | 33.300000 | 100.000000 | 9.400000 | 6.400000 | 1090.840000 | 1.000000 |
We do a box plot of weather and fire index features. These are the features we are considering to predict our class:
forest_fire[['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']].plot(
kind='box',
subplots=True,
sharey=False,
figsize=(15, 5),
title = 'Box plot of weather and fire index features'
)
plt.subplots_adjust(wspace=1) # increase spacing between subplots
plt.show()
Pair plot of the same features, colored by our class to visualize if we can already identify some interesting relations:
df_ff = forest_fire[['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain', 'class']]
sns.pairplot(df_ff,
hue = 'class',
palette = 'deep')
plt.show()
/Users/bogo/opt/anaconda3/envs/MachLe310/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
From our EDA, we notice that our target variable ('class') is not much correlated with any features. The pair plot provides more details: we do not visualize any cluster, i.e., any obvious decision boundary lines. We'll see if a simple classification model like logistic regression can obtain good accuracy for this task and dataset...
Multi-features logistic regression¶
We'll start with only four features, the temperature, the rain, the FFMC and wind.
- Define your features and target variable ('class'):
# YOUR CODE HERE
X_forest = forest_fire[['temp','rain','FFMC', 'wind']]
y_forest = forest_fire[['class']]
- Split you data intro training and test set (use 20% from the data as test set):
# YOUR CODE HERE
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_forest, y_forest,
test_size=0.2,
random_state=9,
shuffle=True)
- Rescale your data, using the scaler of your choice:
# YOUR CODE HERE
# Define the scaler
scaler_f = MinMaxScaler()
# Fit and transform training set
X_train_f = scaler_f.fit_transform(X_train_f)
# Transform test set
X_test_f = scaler_f.transform(X_test_f)
- Build and train a simple logistic regression classifier:
# YOUR CODE HERE
# Set up our model
model_f = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
# Fit our model
model_f.fit(X_train_f, y_train_f.values.flatten()) # We extract values and flatten y_train_f to obtain an array as input instead of data frame
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000)
- Compare the training and testing accuracy of your model
# YOUR CODE HERE
# Accuracy on the test set
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(model_f.score(X_test_f, y_test_f)))
# Accuracy on the training set
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(model_f.score(X_train_f, y_train_f)))
Accuracy of Logistic regression classifier on test set: 0.52 Accuracy of Logistic regression classifier on training set: 0.55
The accuracy seems low, which was expected following our EDA. Let's try to look deeper and gather more information to better evaluate our classifier.
- Plot the distribution of class
# YOUR CODE HERE
y_forest.value_counts().plot.bar(color=['purple', 'blue'], grid=False)
plt.ylabel('Number of observations')
plt.title('Number of observations of each class in the forest fires dataset');
- Compute the default rate and compare it to the accuracy of your model. What do you think?
# YOUR CODE HERE
# Compute the default rate
fire_0 = forest_fire.loc[forest_fire["class"] == 0].shape[0]
print('# occurrence of class 0: ', fire_0)
fire_1 = forest_fire.loc[forest_fire["class"] == 1].shape[0]
print('# occurence of class 1: ', fire_1)
defaultrate_f = max(fire_0, fire_1)/(forest_fire["class"].shape[0])
print(f'Default rate = {defaultrate_f:0.4f}')
# occurrence of class 0: 247 # occurence of class 1: 270 Default rate = 0.5222
The default rate is almost equal to our accuracy, so our classifier does not outperform a naive classifier that would always predict class 1.
- Plot the confusion matrix
# YOUR CODE HERE
# Predict on test set
y_pred_f = model_f.predict(X_test_f)
# Heatmap of confusion matrix
sns.heatmap(confusion_matrix(y_test_f, y_pred_f), annot=True, cmap='Blues', fmt='.4g')
plt.xlabel('Predicted label')
plt.ylabel('True labels')
plt.title('Confusion Matrix');
In our case, our classifier predicts the class 1 most of the time. In other words, we have a high recall but low precision:
print('The precision for class 1 is: {:0.3f}'.format(precision_score(y_test_f, y_pred_f)))
print('The recall for class 1 is: {:0.3f}'.format(recall_score(y_test_f, y_pred_f)))
print('The F1 score for class 1 is: {:0.3f}'.format(f1_score(y_test_f, y_pred_f)))
The precision for class 1 is: 0.518 The recall for class 1 is: 0.811 The F1 score for class 1 is: 0.632
Class 1 indicates that forest areas are burned by fire. If our objective was to implement preventives measures against fire, we would favor higher recall and lower precision. This means that our model is a bit better than a naive classifier, but, still, our model does not deliver very accurate predictions. For this task we need a more complex model.
(Deep) Neural Networks (SOLUTION)¶
It is the easiest if you work on this Notebook in google Colab where you don't need to install tensorflow on your laptop.
Google Colab: https://colab.research.google.com/notebooks/welcome.ipynb#recent=true
Go to "upload" to open this notebook in Google Colab.
If you would like to install tensorflow on your local machine, run the following code in a notebok cell: !pip install tensorflow
Tensorflow/Keras Cheat Sheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf¶
Neural Net with Fashion MNIST¶
Our task is to classify images of clothing into 10 classes using a neural network model. The goal is to get familiar with neural networks applied on real-world datasets.
# load required packages
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, Dropout, MaxPooling2D, Activation, BatchNormalization
import numpy as np
import matplotlib.pyplot as plt
print(tf.__version__) #version should be at least 1.15.x
2.4.1
Dataset¶
We use the Fashion MNIST dataset by Zalando which contains 70,000 grayscale images each assigned one of 10 clothing categories (e.g., Top, Trouser, Sneaker, ...). The images show individual clothing articles at low resolution (28 by 28 pixels). The dataset is available as tensorflow dataset.
We take the split that comes with the dataset where 60,000 images are used to train the network and 10,000 images to evaluate its prediction performance.
# load the training and test data
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
# names of class labels (we have ten classes)
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
Put the datasets in the necessary shapes (dimensionality) and check this.
train_images = train_images.reshape((len(train_images),28,28))
test_images = test_images.reshape((len(test_images),28,28))
# check the shapes of the training and test data
print("shape for training (x) data : ", train_images.shape) # should be: 60'000 Images each with 28x28 pixels
print("shape for training (y) data : ", train_labels.shape) # 60'000 Labels with 10 classes
print("shape for test (x) data : ", test_images.shape) # 10'000 Images with 28x28 pixels
print("shape for test (y) data : ", test_labels.shape) # 10'000 Labels with 10 classes
shape for training (x) data : (60000, 28, 28) shape for training (y) data : (60000,) shape for test (x) data : (10000, 28, 28) shape for test (y) data : (10000,)
# to give you an overview of the data plot first 25 images with corresponding labels
plt.figure(figsize=(10,10))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i], cmap=plt.cm.binary)
plt.xlabel(class_names[train_labels[i]])
plt.show()
Design the Deep Neural Network¶
The goal here is to deisgn a convolutional neural network for classifying clothing images in ten clothing categories. The first and the last layer of the network are given. It is your choice how deep or complex you want to build your neural network.
Here are some possible layers you could use:
Flatten()https://keras.io/api/layers/reshaping_layers/flatten/Dense()(specify a activation function i.e. use the argument,activation=...) https://keras.io/api/layers/core_layers/dense/Dropout()https://keras.io/api/layers/regularization_layers/dropout/
+BatchNormalization https://keras.io/api/layers/normalization_layers/batch_normalization/
Conv2D()(specify a activation function i.e. use the argument,activation=...) https://keras.io/api/layers/convolution_layers/convolution2d/MaxPooling2D()https://keras.io/api/layers/pooling_layers/max_pooling2d/
Hint: Consider that for using the Conv2D model we need to reshape the training and test images to (60000, 28, 28, 1) and (10'000, 28, 28, 1), respectively!
# you can also run this model without adding layers, to see how the most simple model performs
model = Sequential()
#model.add(Flatten(input_shape=(28, 28))) #remove this layer if you want to use Covolutional neural networks
# data reshaped for Convolution2D
train_images=train_images.reshape(60000,28,28,1)
test_images=test_images.reshape(10000,28,28,1)
#### DESIGN YOUR NETWORK HERE ####
model.add(Conv2D(filters = 32, kernel_size=(3,3), strides =1, padding='same', input_shape= (28,28,1), activation="relu"))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.3))
model.add(Conv2D(filters = 64, kernel_size=(4,4), strides =1, padding='same',activation="relu"))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.4))
model.add(Conv2D(filters = 128, kernel_size=(5,5), strides =1, padding='same',activation="relu"))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
#### END YOUR NETWORK DESIGN HERE ####
# final output softmax layer for 10 classes (do not modify this layer)
model.add(Dense(10, activation = 'softmax'))
# print a summary of your model
model.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d_3 (Conv2D) (None, 28, 28, 32) 320 _________________________________________________________________ batch_normalization_3 (Batch (None, 28, 28, 32) 128 _________________________________________________________________ max_pooling2d_3 (MaxPooling2 (None, 14, 14, 32) 0 _________________________________________________________________ dropout_3 (Dropout) (None, 14, 14, 32) 0 _________________________________________________________________ conv2d_4 (Conv2D) (None, 14, 14, 64) 32832 _________________________________________________________________ batch_normalization_4 (Batch (None, 14, 14, 64) 256 _________________________________________________________________ max_pooling2d_4 (MaxPooling2 (None, 7, 7, 64) 0 _________________________________________________________________ dropout_4 (Dropout) (None, 7, 7, 64) 0 _________________________________________________________________ conv2d_5 (Conv2D) (None, 7, 7, 128) 204928 _________________________________________________________________ batch_normalization_5 (Batch (None, 7, 7, 128) 512 _________________________________________________________________ max_pooling2d_5 (MaxPooling2 (None, 3, 3, 128) 0 _________________________________________________________________ dropout_5 (Dropout) (None, 3, 3, 128) 0 _________________________________________________________________ flatten_1 (Flatten) (None, 1152) 0 _________________________________________________________________ dense_2 (Dense) (None, 128) 147584 _________________________________________________________________ dense_3 (Dense) (None, 10) 1290 ================================================================= Total params: 387,850 Trainable params: 387,402 Non-trainable params: 448 _________________________________________________________________
Train the model¶
Compile and train the model you designed.
You can also adapt the batch_size, the number of epochs epochs and the optimizer used in the training process.
# compile model
# you can modify the parameter optimizer with different optimization methods
model.compile(optimizer = 'adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# specify batch size for the optimization (mini-batch)
batch_size = 128
# specify the number of epochs
epochs = 10
# fit model to train data
# we use 10% of the data as validation set
# you can add more 'iterations' by raising the parameter epochs
log = model.fit(train_images,
train_labels,
batch_size=128,
epochs=10,
validation_split=0.1)
# plot accuracy per epoch
plt.plot(log.history['accuracy'], label='Training Accuracy')
plt.plot(log.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.grid()
Epoch 1/10 1875/1875 [==============================] - 9s 4ms/step - loss: 0.7303 - accuracy: 0.7470 - val_loss: 0.3498 - val_accuracy: 0.8718 Epoch 2/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.3548 - accuracy: 0.8673 - val_loss: 0.3284 - val_accuracy: 0.8813 Epoch 3/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.3125 - accuracy: 0.8840 - val_loss: 0.2836 - val_accuracy: 0.8962 Epoch 4/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2854 - accuracy: 0.8944 - val_loss: 0.2748 - val_accuracy: 0.8997 Epoch 5/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2677 - accuracy: 0.9016 - val_loss: 0.2490 - val_accuracy: 0.9103 Epoch 6/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2533 - accuracy: 0.9052 - val_loss: 0.2530 - val_accuracy: 0.9086 Epoch 7/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2460 - accuracy: 0.9084 - val_loss: 0.2372 - val_accuracy: 0.9139 Epoch 8/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2356 - accuracy: 0.9122 - val_loss: 0.2283 - val_accuracy: 0.9164 Epoch 9/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2304 - accuracy: 0.9144 - val_loss: 0.2425 - val_accuracy: 0.9132 Epoch 10/10 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2212 - accuracy: 0.9163 - val_loss: 0.2286 - val_accuracy: 0.9187
Evaluate the model on the test dataset¶
# compute the loss and accuracy
test_scores = model.evaluate(test_images, test_labels, verbose=0)
print("Test loss:", test_scores[0])
print("Test accuracy:", test_scores[1])
# you can also make predictions for the test data
predictions = model.predict(test_images)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# get the maximum probability class
max_probability_predictions = np.argmax(predictions, axis=1)
# compute and display the confusion matrix
conf_mat = confusion_matrix(test_labels, max_probability_predictions)
conf_mat = ConfusionMatrixDisplay(conf_mat)
conf_mat.plot()
The cell below visualizes the some example images from the test set along with their true labels and the model predictions.
# This function plots the first n predictions with their true label (in brackets) and image
test_images=test_images.reshape(10000,28,28)
def plot_image(i, predictions_array, true_label, img):
predictions_array, true_label, img = predictions_array, true_label[i], img[i]
plt.xticks([])
plt.yticks([])
plt.imshow(img, cmap=plt.cm.binary)
predicted_label = np.argmax(predictions_array)
if predicted_label == true_label:
color = 'blue'
else:
color = 'red'
plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
100*np.max(predictions_array),
class_names[true_label]),
color=color)
# plot prediction and image
num_rows = 7
num_cols = 5
num_images = num_rows*num_cols
plt.figure(figsize=(2*num_cols, 1*num_rows))
for i in range(num_images):
plt.subplot(num_rows, num_cols, i+1)
plot_image(i, predictions[i], test_labels, test_images)
plt.tight_layout()
plt.show()
Lab 7, A1 Prediction of City Bike Usage¶
First, we define some default parameters for nice plots
%matplotlib inline
from IPython.display import set_matplotlib_formats, display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from cycler import cycler
set_matplotlib_formats('pdf', 'png')
plt.rcParams['figure.dpi'] = 300
plt.rcParams['image.cmap'] = "viridis"
plt.rcParams['image.interpolation'] = "none"
plt.rcParams['savefig.bbox'] = "tight"
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['legend.numpoints'] = 1
plt.rcParams['xtick.labelsize']=6
plt.rcParams['ytick.labelsize']=6
np.set_printoptions(precision=3, suppress=True)
(a) Reading the City Bike dataset from a pickle file¶
citibike=pd.read_pickle('CitibikeDataSet.pkl')
print("Citibike data:\n{}".format(citibike.head()))
plt.figure(figsize=(12, 5))
xticks = pd.date_range(start=citibike.index.min(), end=citibike.index.max(),
freq='D')
plt.xticks(xticks.astype("int"), xticks.strftime("%a-%m-%d"), rotation=90, ha="left")
plt.plot(citibike.index.astype("int"),citibike, linewidth=2)
plt.xlabel("Date")
plt.ylabel("Rentals")
plt.grid(True)
Citibike data: starttime 2015-08-01 00:00:00 3 2015-08-01 03:00:00 0 2015-08-01 06:00:00 9 2015-08-01 09:00:00 41 2015-08-01 12:00:00 39 Freq: 3H, Name: one, dtype: int64
In Input [4], the time stamp citibike.index is converted into seconds since 1970 (unix or POSIX time,
c.f. https://en.wikipedia.org/wiki/Unix_time.)
# extract the target values (number of rentals)
y = citibike.values
# convert to POSIX time by dividing by 10**9
X = citibike.index.astype("int64").values.reshape(-1, 1) // 10**9
(b) eval on features¶
splits the data into a training and a test set, fits the response y as function of x with a given method (regressor), plots the regression and the predicted output on the training and the test dataset. The regression score is printed to the console.
# use the first 184 data points for training, the rest for testing
n_train = 184
# function to evaluate and plot a regressor on a given feature set
def eval_on_features(features, target, regressor):
# split the given features into a training and a test set
X_train, X_test = features[:n_train], features[n_train:]
# also split the target array
y_train, y_test = target[:n_train], target[n_train:]
regressor.fit(X_train, y_train)
print("Test-set R^2: {:.2f}".format(regressor.score(X_test, y_test)))
y_pred = regressor.predict(X_test)
y_pred_train = regressor.predict(X_train)
plt.figure(figsize=(10, 3))
plt.xticks(range(0, len(X), 8), xticks.strftime("%a %m-%d"), rotation=90,
ha="left")
plt.plot(range(n_train), y_train, label="train")
plt.plot(range(n_train, len(y_test) + n_train), y_test, '-', label="test")
plt.plot(range(n_train), y_pred_train, '--', label="prediction train")
plt.plot(range(n_train, len(y_test) + n_train), y_pred, '--',
label="prediction test")
plt.legend(loc=(1.01, 0))
plt.xlabel("Date")
plt.ylabel("Rentals")
(c) Ensemble Tree Regression on time stamp in seconds¶
Use as a first model input the time in seconds X and as a response y the number of rentals within the three hour
interval. Use as regressor the RandomForestRegressor from sklearn.ensemble using 100
trees. Train and plot the prediction of the regression model using the eval_on_features
function.
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
eval_on_features(X, y, regressor)
Test-set R^2: -0.04
The predictions on the training data are pretty good, as it is for Random Forests is typical, however. However, a constant line is predicted for the test data. The R2 value is -0.04, which means that we have learned nothing. What happened?
The problem is combining our feature with Random Forest. The value of the POSIX time for the test data is outside the range of the values of the Training data record: All points in the test data record have later time stamps than all points in the training dataset. Trees and with it Random Forests cannot extrapolate to values for characteristics outside the training data to map. As a result, the model simply says the value for the next point in the training dataset before - because this is the last point in time to who has any data at all.
(d) Use the daytime in hours as an input for the model:¶
Of course we can improve the model. This is where our "expert knowledge" comes in into the game. When looking at the rental figures in the training dataset we see that two factors are decisive: the time of day and the day of the week. We can't really do anything with POSIX time, so discard it. At first we only try the time of day.
X_hour = citibike.index.hour.values.reshape(-1, 1)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
eval_on_features(X_hour, y, regressor)
Test-set R^2: 0.60
The R2 value is already much better, but the predictions do not reflect the weekly rhythm. So we also add the day of the week.
(e) Regression on weekday and hour of the day¶
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
X_hour_week = np.hstack([citibike.index.dayofweek.values.reshape(-1, 1),
citibike.index.hour.values.reshape(-1, 1)])
eval_on_features(X_hour_week, y, regressor)
Test-set R^2: 0.84
Now we have a model that takes the periodic behavior by considering of weekday and time of day. It has an R2 value of 0.84 and has a a quite good prediction quality. Most likely, this model learns the average number of rentals for any combination of weekday and time of day from the first 23 days in August. This does not actually require a complex model like a Random Forest, so we try as a simpler model LinearRegression:
(d) Linear Regression on weekday and hour¶
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
eval_on_features(X_hour_week, y, LinearRegression())
Test-set R^2: 0.13
Result: LinearRegression works much worse, and the periodicity looks odd. The reason for this is that we use the day of the week and the time as numbers which are interpreted as continuous variables. Therefore the linear model only learns a linear function of the time of day - and it has learned that there are more rentals at later times of the day. However the real patterns are far more complex. We can grasp this by looking at the Integer numbers as categorical variables and use them with the OneHotEncoder transform
(g) One-hot-Encoding of weekday and hour¶
Now we apply the one-hot-encoding to the day and week.
from sklearn.preprocessing import OneHotEncoder
# transform using the OneHotEncoder
enc = OneHotEncoder()
X_hour_week_onehot = enc.fit_transform(X_hour_week).toarray()
eval_on_features(X_hour_week_onehot, y, LinearRegression())
Test-set R^2: 0.61
print(X_hour_week_onehot.shape)
(248, 15)
We get a much better match than with the continuously coded feature. Now the linear model learns one coefficient per weekday Now the linear model learns a coefficient for each weekday and one per time of day. This means, however, that the "time of day periodicity" is is spread over all days of the week.
(h) Polynomial Feature Generation¶
With interaction features, we can define a coefficient for each combination of of day of the week and time of day in the model.
from sklearn.preprocessing import PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=2, interaction_only=True,
include_bias=False)
X_hour_week_onehot_poly = poly_transformer.fit_transform(X_hour_week_onehot)
eval_on_features(X_hour_week_onehot_poly, y, Ridge())
Test-set R^2: 0.85
print(X_hour_week_onehot_poly.shape)
(248, 120)
With this transformation, we finally get a model that is as good as the Random Forest cuts. A big advantage of this model is that very clearly is what it has learned: a coefficient for every day and every time. We can simply plot the coefficients determined by the model, which is done with would not be possible in a Random Forest.
Lab 7, A2: Feature Selection on the Wisconsin Breast Cancer Dataset¶
%matplotlib inline
#from preamble import *
from IPython.display import set_matplotlib_formats, display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
set_matplotlib_formats('pdf', 'png')
plt.rcParams['figure.dpi'] = 300
plt.rcParams['image.cmap'] = "viridis"
plt.rcParams['image.interpolation'] = "none"
plt.rcParams['savefig.bbox'] = "tight"
plt.rcParams['lines.linewidth'] = 2
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectPercentile
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
print(cancer.target_names)
print(cancer.feature_names)
['malignant' 'benign'] ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']
(a) Plotting Histograms¶
fig, axes = plt.subplots(15, 2, figsize=(10, 20))
malignant = cancer.data[cancer.target == 0]
benign = cancer.data[cancer.target == 1]
ax = axes.ravel()
for i in range(30):
_, bins = np.histogram(cancer.data[:, i], bins=50)
ax[i].hist(malignant[:, i], bins=bins, color='b', alpha=.5)
ax[i].hist(benign[:, i], bins=bins, color='g', alpha=.5)
ax[i].set_title(cancer.feature_names[i])
ax[i].set_yticks(())
ax[0].set_xlabel("Größe des Merkmals")
(b) Automatic Feature Selection: Univariate statistics (F-test, ANOVA)¶
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0, test_size=.5)
# use f_classif (the default) and SelectPercentile to select 50% of features
select = SelectPercentile(percentile=50)
select.fit(X_train, y_train)
# transform training set
X_train_selected = select.transform(X_train)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_selected.shape: {}".format(X_train_selected.shape))
X_train.shape: (284, 30) X_train_selected.shape: (284, 15)
mask = select.get_support()
print(mask)
# visualize the mask. black is True, white is False
plt.matshow(mask.reshape(1, -1), cmap='gray_r')
plt.xlabel("Sample index")
plt.yticks(())
[ True False True True False True True True False False True False True True False False False False False False True False True True False True True True False False]
([], <a list of 0 Text yticklabel objects>)
# visualize the mask. black is True, white is False
plt.figure(figsize=(15, 10))
n_features = cancer.data.shape[1]
plt.title("Feature Score")
plt.xticks(np.arange(n_features), cancer.feature_names, rotation=90, fontsize=16)
plt.ylabel("Größe des Merkmals")
plt.bar(range(n_features),select.scores_)
<BarContainer object of 30 artists>
mask.reshape(1, -1)
array([[ True, False, True, True, False, True, True, True, False,
False, True, False, True, True, False, False, False, False,
False, False, True, False, True, True, False, True, True,
True, False, False]])
Let's fit now a linear model (logistic regression) model on the reduced input features to check the accuracy on the reduced input features.
from sklearn.linear_model import LogisticRegression
# transform test data
X_test_selected = select.transform(X_test)
lr = LogisticRegression(max_iter=5000)
lr.fit(X_train, y_train)
print("Score with all features: {:.3f}".format(lr.score(X_test, y_test)))
lr.fit(X_train_selected, y_train)
print("Score with only selected features: {:.3f}".format(
lr.score(X_test_selected, y_test)))
Score with all features: 0.958 Score with only selected features: 0.358
(c) Automatic Feature selection: Model-based Feature Selection using Random Forest Classifier¶
A random forest classifier gives back the relative feature importance. This score can be used to select the most important featurs.
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
print("Feature importances:\n{}".format(tree.feature_importances_))
def plot_feature_importances_cancer(model):
plt.figure(figsize=(15,10))
n_features = cancer.data.shape[1]
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), cancer.feature_names,fontsize=12)
plt.xticks(fontsize=12)
plt.xlabel("Feature importance",fontsize=16)
plt.ylabel("Feature",fontsize=16)
plt.ylim(-1, n_features)
plot_feature_importances_cancer(tree)
Accuracy on training set: 0.982 Accuracy on test set: 0.926 Feature importances: [0. 0.02355646 0. 0. 0. 0. 0.02214307 0. 0. 0.02775379 0.00513077 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.11199707 0. 0. 0. 0. 0.80941884 0. 0. ]
(d) Automatic Feature selection: Model-based Feature Selection using Random Forest Classifier¶
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
select = SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42),
threshold="median")
select.fit(X_train, y_train)
X_train_l1 = select.transform(X_train)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_l1.shape: {}".format(X_train_l1.shape))
X_train.shape: (284, 30) X_train_l1.shape: (284, 15)
mask = select.get_support()
# visualize the mask. black is True, white is False
plt.matshow(mask.reshape(1, -1), cmap='gray_r')
plt.xlabel("Sample index")
plt.yticks(())
([], <a list of 0 Text yticklabel objects>)
X_test_l1 = select.transform(X_test)
score = LogisticRegression(solver='lbfgs',max_iter=1000).fit(X_train_l1, y_train).score(X_test_l1, y_test)
print("Test score: {:.3f}".format(score))
Test score: 0.954
Lab 7, A3: Yelp Business Classification using Bag-of-Words and tf-idf¶
Yelp is a crowd-sourced review forum, as well as an American multinational corporation headquartered in San Francisco, California. It develops, hosts and markets Yelp.com and the Yelp mobile app, which publish crowd-sourced reviews about local businesses, as well as the online reservation service Yelp Reservations. The company also trains small businesses in how to respond to reviews, hosts social events for reviewers, and provides data about businesses, including health inspection scores. The data is open and can be downloaded here https://www.yelp.com/dataset/challenge.
We have prepared a partial dataset from the big Yelp dataset in the pickle serialization format that you can read into a pandas dataframe using pd.read_pickle.
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.linear_model import LogisticRegression
import sklearn.model_selection as modsel
import sklearn.preprocessing as preproc
(a) Load and prep Yelp reviews data¶
DataPath='D:/Downloads/YelpDataset/'
nightlife_subset = pd.read_pickle('nightlife_subset.pkl')
restaurant_subset = pd.read_pickle('restaurant_subset.pkl')
(b) combine both datasets¶
combined = pd.concat([nightlife_subset, restaurant_subset])
combined['target'] = combined.apply(lambda x: 'Nightlife' in x['categories'],
axis=1)
combined
| business_id | name | stars_y | text | categories | target | |
|---|---|---|---|---|---|---|
| 2203299 | lpYFsXFrojiBZ1kbWR2lZw | Four Peaks Grill & Tap | 5 | Great service and food... Enjoy the atmosphere... | Food, Restaurants, American (New), Local Flavo... | True |
| 482774 | KskYqH1Bi7Z_61pH6Om8pg | Lotus of Siam | 5 | Lotus is one of my all time favorites in Las V... | Wine Bars, Nightlife, Restaurants, Seafood, Ca... | True |
| 3086879 | zFnPRtP7LGvr3sfxvy_dfg | Revolution Ale House | 1 | This place has definitely gone downhill. We fi... | Nightlife, Italian, Restaurants, Pizza, Bars | True |
| 512115 | gUR2pWQKLPgMEm_R_aI_aw | Shooters On The Water | 3 | Thought this might be a worn out hang out for ... | Dive Bars, American (New), American (Tradition... | True |
| 3278073 | hmYnzs8-aHbltaOOGDgmbA | Zipps Sports Grill | 4 | Zipps is almost always on happy hour! They hav... | Sports Bars, American (Traditional), Nightlife... | True |
| 717572 | aHACz8VHbBV5Je6q9u7g0Q | Burgers 2 Beer - Concord Twp. | 1 | The Concord Burgers to Beer NEEDS A NEW MANAGE... | Gastropubs, Bars, Nightlife, Restaurants, Beer... | True |
| 590714 | bXLaGCKzkQcA2hLT-JZQ1w | J. Sams Wine Bar | 5 | I saw the good reviews on here and wanted to c... | Restaurants, Arts & Entertainment, American (N... | True |
| 2165889 | 7sPNbCx7vGAaH7SbNPZ6oA | Bachi Burger | 1 | So it's been about 4 weeks since I was told th... | Restaurants, Food, Asian Fusion, American (New... | True |
| 621638 | DublKfLa9Y0PguCryoDJ-Q | Cantina 1511 | 5 | Great food & service! Stan the manager was the... | New Mexican Cuisine, Bars, Nightlife, Mexican,... | True |
| 2048614 | iH6heFdMwPXk9GIO6PwUvA | Beerhaus | 5 | We had fun at happy hour. Truth or dare Jenga ... | Breweries, Beer Bar, Bars, American (New), Foo... | True |
| 12861 | wKlH90YB5RYFvJ8N3pstVw | Union Standard | 5 | I go to Union Standard's happy hour from 4-6 a... | Seafood, Cocktail Bars, Nightlife, Restaurants... | True |
| 1064434 | B7pLK62P0rRxz25HV4RXFA | Victory Alley | 4 | I actually don't want to review this place and... | Bars, Burgers, Sports Bars, Chicken Wings, Nig... | True |
| 2164784 | 7sPNbCx7vGAaH7SbNPZ6oA | Bachi Burger | 4 | Went here because of Diners, Drive-Ins and Div... | Restaurants, Food, Asian Fusion, American (New... | True |
| 1315939 | rZOzhSA5HP6IdpxuN4v66w | Nikko Japanese Restaurant & Sushi Bar | 5 | I absolutely love the Sushi here. The rolls ar... | Party & Event Planning, Japanese, Dance Clubs,... | True |
| 3249081 | foQZ6guS0l49trURKh7vlA | Double Barrel Roadhouse | 4 | The best things about this spot, are the atmos... | Cocktail Bars, Bars, Nightlife, American (Trad... | True |
| 501133 | vjlnj2qGXOQrLVLaCCF_mw | Desert Rose Pizza & Gastropub | 5 | Everything was wonderful, service, food, cockt... | Sports Bars, Tobacco Shops, Nightlife, Restaur... | True |
| 3101497 | 90bL34o2KEes9pUnCOm7pQ | The Gladly | 5 | The Gladly is located on the SE corner of Came... | Bars, American (New), Venues & Event Spaces, N... | True |
| 1862174 | DYuOxkW4DtlJsTHdxdXSlg | Bahama Breeze | 2 | Maybe we got the wrong dishes. I got the chipo... | Laotian, Bars, Nightlife, Restaurants, Seafood... | True |
| 1641551 | jQJYvzUFsz9ytI1AzW0dyQ | Applebee's Grill + Bar | 1 | Walked in and no one greeting us, there was a ... | Steakhouses, Bars, Sports Bars, Nightlife, Ame... | True |
| 3431743 | pH0BLkL4cbxKzu471VZnuA | SUSHISAMBA - Las Vegas | 2 | A lot of experience was dampened by the servic... | Chinese, Japanese, Nightlife, Dim Sum, Asian F... | True |
| 1972070 | P7pxQFqr7yBKMMI2J51udw | Holsteins Shakes & Buns | 5 | I'm posting this review on behalf of my husban... | American (Traditional), Restaurants, Burgers, ... | True |
| 1930383 | VsPoQeCRYYHQrj9jbiLmtA | Russell's Pub N Grill | 2 | Alright folks- this place could've been better... | Restaurants, Nightlife, American (New), Bars, ... | True |
| 1383937 | l_kefVF1frmC0xRW2YkvUA | Whisky River | 3 | Came in here with my sister. I'm in town for a... | Nightlife, Music Venues, Cocktail Bars, Venues... | True |
| 2471262 | GZfz7YiV1fUHjsBj_8ytZA | Anthony's Sports Bar Restaurant & Catering | 4 | We went to this bar for the sole reason to wat... | Event Planning & Services, Nightlife, Bars, Ca... | True |
| 1099305 | 2skQeu3C36VCiB653MIfrw | Bootleggers Modern American Smokehouse | 2 | Not what we expected and not worth the money. ... | Barbeque, Restaurants, Nightlife, Bars, Food, ... | True |
| 1774310 | xy1McNUocWlt-8DZ7Ifg9A | Pravda Vodka Bar | 3 | Saturday night\nI went to Pravda for my friend... | Bars, Nightlife, Lounges | True |
| 3240924 | iIok1p4qnpGAa07xoaXRQA | Field Table | 4 | Great service, vibe, and eats. I went for brun... | Restaurants, Nightlife, Cocktail Bars, Bars, C... | True |
| 2576812 | _T7f2wUgNlJqxsR-cR89SQ | Scaramouche Restaurant Pasta Bar & Grill | 5 | Hands down the best restaurant in Toronto. Ha... | Restaurants, Bars, French, Nightlife | True |
| 2331196 | 8tKhimgRiNx74LbYDu9LIw | Greek Town | 3 | Close to a 4 star but...prawns were previously... | Wine Bars, Restaurants, Greek, Bars, Nightlife | True |
| 898173 | pHpU8lnnxMuPWRHOysuMIQ | Salut Kitchen Bar | 5 | In searching for an ideal venue for a group ce... | Bars, Nightlife, Beer, Wine & Spirits, Restaur... | True |
| ... | ... | ... | ... | ... | ... | ... |
| 510186 | tvYID0arhN-shKGUrC1Wsg | Copacabana Brazilian Steak House | 4 | Meat on meat on meat!\n\nIt was my second time... | Beer, Wine & Spirits, Restaurants, Brazilian, ... | False |
| 229587 | QvltB7RjVOVRBl685azJ5g | Burntwood Tavern | 1 | We just left, we have been there many times bu... | American (Traditional), Restaurants | False |
| 2011253 | fcyk-PZKRqo4EUJ0vH1aNg | Capps Pizza | 1 | Called and ordered a pizza at 7pm and it still... | Pizza, Chicken Wings, Restaurants, Burgers | False |
| 1039515 | 6ZIHxvFTHC1pvAzAS0uLDA | Lee's Sandwiches | 4 | Try the bbq pork sandwich... yum!! Gotta get e... | Vietnamese, Sandwiches, Food, Ice Cream & Froz... | False |
| 2066597 | spDZkD6cp0JUUm6ghIWHzA | Kitchen M | 1 | Would give this place 0 stars if that were pos... | Restaurants, Chinese | False |
| 1733536 | GGecutXeoEVlYKoxVo2WPA | Mixteca Mexican Food | 4 | It's hard to go wrong at Mixteca. At first gl... | Mexican, Restaurants | False |
| 2740998 | QDRFdG8gPPKL7r4yic8j7Q | The Original Burrito Company | 4 | So I've gota bad habit of ordering the same th... | Mexican, Restaurants | False |
| 2603889 | rbcfYmJtqwIkk17IeOI5Kw | BARDOT Brasserie | 5 | The best brunch in Las Vegas. \n\nThat sounds... | Cafes, Hostels, Brasseries, Hotels & Travel, R... | False |
| 1023415 | J2Am_nJkdicGk2S1DzwuPA | Carl's Jr | 1 | The worst carls jr. I have been to in the Paci... | Fast Food, Restaurants | False |
| 1938088 | kRgAf6j2y1eR0wOFdzFAuw | Firefly | 4 | Great bar with huge liquor selection. He tapas... | Tapas/Small Plates, Restaurants, Tapas Bars, S... | False |
| 1231561 | vWFhRvHVIJAzIeOX4g_YcA | Original Tommy's | 1 | All I have to say is I see why this place neve... | Restaurants, Burgers | False |
| 3362107 | BWWzh28StP6hkMm5L4nCAQ | Pita Land | 2 | This fast food restaurant is strangely under-s... | Middle Eastern, Sandwiches, Halal, Fast Food, ... | False |
| 727051 | j-5O-Ehd2eaCHYgmTSfoRw | Chipotle Mexican Grill | 2 | When I was halfway through the long line at Ch... | Mexican, Restaurants, Fast Food | False |
| 2312132 | EWpRPVSiPxWbdJBgNDNvGw | Piada Italian Street Food | 5 | My husband and I went to the soft opening earl... | Restaurants, Salad, Wraps, Italian | False |
| 2859049 | -_TSaVr53qiEGqMkwyEMaQ | Parsley Modern Mediterranean | 4 | Decent food, flavorful and filling, especially... | Sandwiches, Mediterranean, Restaurants, Middle... | False |
| 97004 | zidkKI_N1OPxsiddTOQH_Q | Naked BBQ | 5 | Best barbeque brisket I have ever eaten. The o... | American (New), Caterers, Southern, Restaurant... | False |
| 852068 | q3dJQtwZQrrurNT-1bNKgQ | Capo's Italian Cuisine | 5 | Had a great night and Armondo served us well! ... | Restaurants, Italian | False |
| 916914 | 7iuruLs-q-_RW06IbsBxZw | Arirang Korean BBQ | 5 | I love this place. Every time I come I get the... | Korean, Barbeque, Restaurants | False |
| 2932256 | S599hCA4kJJO3_b6SRFKoA | Michoacan | 1 | This is the worst place I have ever been to! I... | Seafood, Restaurants, Mexican, Breakfast & Brunch | False |
| 2475995 | 4JNXUYY8wbaaDmk3BPzlWw | Mon Ami Gabi | 4 | Visited early August 2016. \nWe came for break... | Steakhouses, Breakfast & Brunch, Restaurants, ... | False |
| 2970425 | PeATTp15Y_ExaN6mR1dmKw | Quaker Steak & Lube | 4 | Chicken was awesome! The Smokey Gold barbecue ... | Steakhouses, Restaurants, Chicken Wings, Ameri... | False |
| 1176580 | CPgz4srKkE5u9aoBAOaQsA | Davidson Pizza Company | 4 | Good selection of specialty pizzas. $18 range ... | Chicken Wings, Salad, Restaurants, Pizza | False |
| 2786561 | Jy40ercZIQaNcz2qV3qgow | Mi Amigo's Mexican Grill | 5 | Amazing prices, amazing food, amazing margarit... | Restaurants, Mexican | False |
| 1147706 | ysJo5Jdo29XOBCnKrbUKWg | Paris 66 | 4 | Fine French food is rare in Pittsburgh. Fairl... | Breakfast & Brunch, Creperies, Gluten-Free, Br... | False |
| 2061131 | sGZg_8t5I5WopNwsW7riAQ | Wild Ocean Seafood Market & Grille | 5 | Came for the fresh, wild caught fish and fell ... | Restaurants, Seafood Markets, Food, Meat Shops... | False |
| 2429937 | Vs7gc9EE3k9wARuUcN9piA | Pan Asian | 5 | Pan Asian stays 5 stars for me. I went back t... | Thai, Chinese, Japanese, Restaurants | False |
| 640067 | mYzlPKXvOVRrQivHnDqD5g | YamChops | 5 | Amazingly delicious with a variety of distinct... | Butcher, Juice Bars & Smoothies, Vegan, Restau... | False |
| 1566257 | 9Zl4uWSgSMpxHnsK_MPneg | Palermo Family Restaurant | 4 | Eat here quite often, the Palermo Special piz... | Restaurants, Pizza | False |
| 1607254 | F-AYOq1xIY2u_qmWUG5VBw | Hakka Ren | 5 | It was amazing! I had the chilli fish schezuan... | Chinese, Halal, Indian, Restaurants | False |
| 1076163 | yNBVOKSZN_AIjSJdhF_rqA | Garbanzo Mediterranean Grill | 4 | My salad was really good. Nice atmosphere, fri... | Falafel, Event Planning & Services, American (... | False |
20000 rows × 6 columns
print(combined.shape)
(20000, 6)
(c) Split the dataset into a training and test set¶
# Split into training and test data sets
training_data, test_data = modsel.train_test_split(combined,
train_size=0.7,
random_state=123)
training_data.shape
(14000, 6)
test_data.shape
(6000, 6)
(d) Transform the text as BoW (bag-of-words)¶
# Represent the review text as a bag-of-words
bow_transform = text.CountVectorizer()
X_tr_bow = bow_transform.fit_transform(training_data['text'])
len(bow_transform.vocabulary_)
29944
X_tr_bow.shape
(14000, 29944)
X_te_bow = bow_transform.transform(test_data['text'])
y_tr = training_data['target']
y_te = test_data['target']
(e,f) Classify with logistic regression¶
def simple_logistic_classify(X_tr, y_tr, X_test, y_test, description, _C=1.0):
## Helper function to train a logistic classifier and score on test data
m = LogisticRegression(C=_C, solver='newton-cg',max_iter=500).fit(X_tr, y_tr)
s = m.score(X_test, y_test)
print ('Test score with', description, 'features:', s)
return m
m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow')
Test score with bow features: 0.7145
(f) applying normalization to the features¶
X_tr_l2 = preproc.normalize(X_tr_bow, axis=0)
X_te_l2 = preproc.normalize(X_te_bow, axis=0)
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized')
Test score with l2-normalized features: 0.739
(g) tf-idf represenation¶
# Create the tf-idf representation using the bag-of-words matrix
tfidf_trfm = text.TfidfTransformer(norm=None)
X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow)
X_te_tfidf = tfidf_trfm.transform(X_te_bow)
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf')
Test score with tf-idf features: 0.6798333333333333
(h) Tune regularization parameters using grid search¶
param_grid_ = {'C': [1e-5, 1e-3, 1e-1, 1e0, 1e1, 1e2]}
bow_search = modsel.GridSearchCV(LogisticRegression(solver='lbfgs',max_iter=1000),
cv=5, param_grid=param_grid_, return_train_score=True)
l2_search = modsel.GridSearchCV(LogisticRegression(solver='lbfgs',max_iter=500),
cv=5, return_train_score=True, param_grid=param_grid_)
tfidf_search = modsel.GridSearchCV(LogisticRegression(solver='lbfgs',max_iter=500),
cv=5, return_train_score=True, param_grid=param_grid_)
bow_search.fit(X_tr_bow, y_tr)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
GridSearchCV(cv=5, error_score=nan,
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True,
intercept_scaling=1, l1_ratio=None,
max_iter=1000, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs',
tol=0.0001, verbose=0,
warm_start=False),
iid='deprecated', n_jobs=None,
param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=None, verbose=0)
bow_search.best_score_
0.7190000000000001
l2_search.fit(X_tr_l2, y_tr)
GridSearchCV(cv=5, error_score=nan,
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True,
intercept_scaling=1, l1_ratio=None,
max_iter=500, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs',
tol=0.0001, verbose=0,
warm_start=False),
iid='deprecated', n_jobs=None,
param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=None, verbose=0)
l2_search.best_score_
0.7276428571428571
tfidf_search.fit(X_tr_tfidf, y_tr)
C:\Users\wurc\.conda\envs\ML\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations. "of iterations.", ConvergenceWarning) C:\Users\wurc\.conda\envs\ML\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations. "of iterations.", ConvergenceWarning) C:\Users\wurc\.conda\envs\ML\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations. "of iterations.", ConvergenceWarning) C:\Users\wurc\.conda\envs\ML\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations. "of iterations.", ConvergenceWarning)
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True,
intercept_scaling=1, l1_ratio=None,
max_iter=500, multi_class='warn',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs',
tol=0.0001, verbose=0,
warm_start=False),
iid='warn', n_jobs=None,
param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=None, verbose=0)
tfidf_search.best_score_
0.7340714285714286
What regularization parameters are best for each method?
bow_search.best_params_
{'C': 0.1}
l2_search.best_params_
{'C': 1.0}
tfidf_search.best_params_
{'C': 0.001}
Let's check one of the grid search outputs to see how it went:
bow_search.cv_results_
{'mean_fit_time': array([0.14379988, 0.21599741, 1.66459932, 4.06620111, 7.41766043,
7.58538895]),
'std_fit_time': array([0.01511577, 0.02066551, 0.09023791, 0.2676194 , 0.31160726,
0.7963182 ]),
'mean_score_time': array([0.002598 , 0.00240026, 0.00260324, 0.0028017 , 0.00260191,
0.00320039]),
'std_score_time': array([0.00049296, 0.00049049, 0.00049121, 0.00074931, 0.00049081,
0.00147161]),
'param_C': masked_array(data=[1e-05, 0.001, 0.1, 1.0, 10.0, 100.0],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'C': 1e-05},
{'C': 0.001},
{'C': 0.1},
{'C': 1.0},
{'C': 10.0},
{'C': 100.0}],
'split0_test_score': array([0.56535714, 0.70464286, 0.71285714, 0.69785714, 0.67857143,
0.66785714]),
'split1_test_score': array([0.55 , 0.71178571, 0.71964286, 0.7075 , 0.68178571,
0.66821429]),
'split2_test_score': array([0.57285714, 0.70892857, 0.71642857, 0.70642857, 0.68321429,
0.66892857]),
'split3_test_score': array([0.56857143, 0.72035714, 0.725 , 0.69821429, 0.67964286,
0.67035714]),
'split4_test_score': array([0.54535714, 0.71392857, 0.72107143, 0.70428571, 0.67857143,
0.66321429]),
'mean_test_score': array([0.56042857, 0.71192857, 0.719 , 0.70285714, 0.68035714,
0.66771429]),
'std_test_score': array([0.01077933, 0.00523723, 0.00412434, 0.00407206, 0.00184888,
0.00240747]),
'rank_test_score': array([6, 2, 1, 3, 4, 5]),
'split0_train_score': array([0.56491071, 0.73678571, 0.86758929, 0.95883929, 0.99491071,
1. ]),
'split1_train_score': array([0.55928571, 0.73589286, 0.86803571, 0.96035714, 0.995 ,
1. ]),
'split2_train_score': array([0.56482143, 0.73785714, 0.86598214, 0.95964286, 0.99553571,
1. ]),
'split3_train_score': array([0.56848214, 0.73357143, 0.86669643, 0.95982143, 0.995625 ,
1. ]),
'split4_train_score': array([0.56482143, 0.73428571, 0.86625 , 0.95785714, 0.99464286,
1. ]),
'mean_train_score': array([0.56446429, 0.73567857, 0.86691071, 0.95930357, 0.99514286,
1. ]),
'std_train_score': array([0.0029467 , 0.00157467, 0.00078368, 0.0008719 , 0.00037712,
0. ])}
import pickle
results_file = open('tfidf_gridcv_results.pkl', 'wb')
pickle.dump(bow_search, results_file, -1)
pickle.dump(tfidf_search, results_file, -1)
pickle.dump(l2_search, results_file, -1)
results_file.close()
pkl_file = open('tfidf_gridcv_results.pkl', 'rb')
bow_search = pickle.load(pkl_file)
tfidf_search = pickle.load(pkl_file)
l2_search = pickle.load(pkl_file)
pkl_file.close()
search_results = pd.DataFrame.from_dict({'bow': bow_search.cv_results_['mean_test_score'],
'tfidf': tfidf_search.cv_results_['mean_test_score'],
'l2': l2_search.cv_results_['mean_test_score']})
search_results
| bow | tfidf | l2 | |
|---|---|---|---|
| 0 | 0.560429 | 0.701643 | 0.502500 |
| 1 | 0.711929 | 0.734071 | 0.502500 |
| 2 | 0.719000 | 0.692357 | 0.710357 |
| 3 | 0.702857 | 0.673286 | 0.727643 |
| 4 | 0.680357 | 0.665714 | 0.716286 |
| 5 | 0.667714 | 0.660714 | 0.694786 |
Plot cross validation results¶
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
ax = sns.boxplot(data=search_results, width=0.4)
ax.set_ylabel('Accuracy', size=14)
ax.tick_params(labelsize=14)
plt.savefig('tfidf_gridcv_results.png')
m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow',
_C=bow_search.best_params_['C'])
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized',
_C=l2_search.best_params_['C'])
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf',
_C=tfidf_search.best_params_['C'])
Test score with bow features: 0.7293333333333333 Test score with l2-normalized features: 0.739 Test score with tf-idf features: 0.7413333333333333
bow_search.cv_results_['mean_test_score']
array([0.56042857, 0.71192857, 0.719 , 0.70285714, 0.68035714,
0.66771429])
Lab07: A4 Free Spoken Digits Classification (FSDD dataset)¶
In this notebook, we introduce a possible approach to the Free Spoken Digit Dataset classification problem.
The machine learning models are able to predict audio labels with an accuracy of 98%.
This notebook was inspired by: Inam ur Rehman, Data Analyst at PIC - Servizi per l'informatica, Turin, Piedmont, Italy
In particular¶
we shall see:
- How to play audio files in Python
- How to sample audio signal into digital form
- How to remove leading and trailing noise (e.g. silence) from audio
- How to set a naive baseline (use of time domain)
- How to combine time and frequency domain features (Spectogram)
- How to train, build and predict using different machine learning models
Import Necessary Libraries¶
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from os import listdir
from os.path import join
from scipy.io import wavfile
import IPython.display as ipd
from librosa.feature import melspectrogram
from librosa import power_to_db
from librosa.effects import trim
# plotting utilities
%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 4)
plt.rcParams["figure.titleweight"] = 'bold'
plt.rcParams["figure.titlesize"] = 'large'
plt.rcParams['figure.dpi'] = 120
#plt.style.use('fivethirtyeight')
rs = 99
Load data¶
- The Free Spoken Digit Dataset is a collection of audio recordings of utterances of digits (“zero” to “nine”) from different people.
- You can download the dataset here: https://www.kaggle.com/datasets/joserzapata/free-spoken-digit-dataset-fsdd
- Download the data an your local drive and set the directory path accordingly.
The goal of this competition is to correctly identify the digit being uttered in each recording.
files = 'E:/temp/ML_datasets/recordings'
ds_files = listdir(files)
X = []
y = []
for file in ds_files:
label = int(file.split("_")[0])
rate, data = wavfile.read(join(files, file))
X.append(data.astype(np.float16))
y.append(label)
len(X), len(y)
(3000, 3000)
np.unique(y, return_counts = True)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([300, 300, 300, 300, 300, 300, 300, 300, 300, 300], dtype=int64))
The problem is well balanced: for each of the classes we have 300 samples in dataset.
All recordings are sampled at the rate of 8 kHZ
Audio signals have different length.
Some of them have leading and silence intervals. Let's analyze that first.
(b) Plot a histogram of the length of the spoken words.¶
- What is the average length of the spoken words?
- what is the standard deviation of the length of the words?
- What is the 90% percentile of the length of the records?
- What is the number of outliers that excede the 90% quantile?
rate = 8000
def show_length_distribution(signals, rate = 8000):
sampel_times = [len(x)/rate for x in signals]
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.20, .80)})
# Add a graph in each part
sns.boxplot(x = sampel_times, ax=ax_box, linewidth = 0.9, color= '#9af772')
sns.histplot(x = sampel_times, ax=ax_hist, bins = 'fd', kde = True)
# Remove x axis name for the boxplot
ax_box.set(xlabel='')
title = 'Audio signal lengths'
x_label = 'duration (seconds)'
y_label = 'count'
plt.suptitle(title)
ax_hist.set_xlabel(x_label)
ax_hist.set_ylabel(y_label)
plt.show()
return sampel_times
lengths = show_length_distribution(X)
np.mean(lengths)
0.4374343333333333
np.std(lengths)
0.14761839633964627
q = 90
np.percentile(lengths, q)
0.604525
tot_outliers = sum(map(lambda x: x > np.percentile(lengths, q), lengths))
print(f'Values outside {q} percentile: {tot_outliers}')
Values outside 90 percentile: 300
These outliers will be later handled according to the proposed solutions.
(c) Play and display one of the records.¶
Have a look at some extreme cases. Use IPython.display.ipd to display and play the audio signal.
- Look at the longest signal
- look at the shortest signal
Longest_audio = np.argmax([len(x) for x in X])
plt.plot(X[Longest_audio])
plt.title("Longest audio signal");
plt.grid(True)
plt.xlabel('sample')
plt.ylabel('amplitude (arb.)')
ipd.Audio(X[Longest_audio], rate=rate)
Shortest_audio = np.argmin([len(x) for x in X])
plt.plot(X[Shortest_audio])
plt.title("Shortest audio signal");
plt.grid(True)
plt.xlabel('sample')
plt.ylabel('amplitude (arb.)')
ipd.Audio(X[Shortest_audio], rate=rate)
# by default anything below 10 db is considered as silence
def remove_silence(sample, sr= 8000, top_db = 10):
"""This function removes trailing and leading silence periods of audio signals.
"""
y = np.array(sample, dtype = np.float64)
# Trim the beginning and ending silence
yt, _ = trim(y, top_db= top_db)
return yt
X_tr = [remove_silence(x) for x in X]
show_length_distribution(X_tr);
We can explore different recordings to see how they are trimmed.
plt.plot(X_tr[Longest_audio])
plt.title("Longest recording after trimming");
plt.grid(True); plt.xlabel('sample'); plt.ylabel('amplitude (arb.)')
ipd.Audio(X_tr[Longest_audio], rate=rate)
(e) Create a matrix with uniform length of columns to allign all recordings.¶
- All signals should have
rate*0.8=6400data points.
N = int(rate * 0.8) # 0.8 is the upper limit of trimmed audio length
print(N)
6400
X_uniform = []
for x in X_tr:
if len(x) < N:
X_uniform.append(np.pad(x, (0, N - len(x)), constant_values = (0, 0)))
else:
X_uniform.append(x[:N])
(f) Audio feature generation¶
- Write a function that creates bins of same width and computes mean and standard deviation on those bins as features for audio classification.
def into_bins(X, bins = 20):
"""This functions creates bins of same width and computes mean and standard deviation on those bins
"""
X_mean_sd = []
for x in X:
x_mean_sd = []
As = np.array_split(np.array(x), 20)
for a in As:
mean = np.round(a.mean(dtype=np.float64), 4)
sd = np.round(a.std(dtype=np.float64), 4)
x_mean_sd.extend([mean, sd])
X_mean_sd.append(x_mean_sd)
return np.array(X_mean_sd)
(g) Random Forest classifier on the time domain features¶
- Train a random forest clasifier on the dataset consisting of the mean and the standard deviation of the bins.
- Hypertune the random forest classifier using a grid search over the following parameters:
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
Number of bins is an hyperparameter.
We will try different n. of bins with default configurations of Random Forest Classifier.
for bins in range(20,101,20):
X_mean_sd = into_bins(X_uniform, bins)
X_train, X_test, y_train, y_test = train_test_split(X_mean_sd, y, test_size = 0.20, random_state = rs)
clf = RFC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
p,r,f,s = precision_recall_fscore_support(y_test, y_pred)
print(f"for {bins} bins, f-macro average:{f.mean()}, accuracy: {acc}")
for 20 bins, f-macro average:0.5575460723648693, accuracy: 0.565 for 40 bins, f-macro average:0.5899310247582592, accuracy: 0.595 for 60 bins, f-macro average:0.5678999146226718, accuracy: 0.5733333333333334 for 80 bins, f-macro average:0.5559281871127035, accuracy: 0.565 for 100 bins, f-macro average:0.5641751372289103, accuracy: 0.5716666666666667
With 60 bins we are able to get comparable results.
Now we will train two models via grid search to optimize the configuration.
Hyperperameter tuning¶
X_time = into_bins(X_uniform, 60)
X_train, X_test, y_train, y_test = train_test_split(X_time, y, test_size = 0.20, random_state = rs)
Random Forest Classifier¶
param_grid = {
"n_estimators": [100,150,200],
"criterion": ["gini", "entropy"],
"min_impurity_decrease": [0.0,0.05,0.1]
}
clf = RFC(random_state = rs, n_jobs = -1 )
grid_search = GridSearchCV(clf, param_grid, scoring = "f1_macro", cv = 5)
grid_search.fit(X_train, y_train)
print("best Parameters for RF model:\n", grid_search.best_params_)
print("best score:", grid_search.best_score_)
print("\n\n Results on test dataset:\n\n")
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
best Parameters for RF model:
{'criterion': 'gini', 'min_impurity_decrease': 0.0, 'n_estimators': 150}
best score: 0.5922881417565689
Results on test dataset:
precision recall f1-score support
0 0.65 0.73 0.69 62
1 0.49 0.48 0.49 58
2 0.50 0.52 0.51 56
3 0.38 0.24 0.29 62
4 0.44 0.44 0.44 63
5 0.47 0.45 0.46 62
6 0.79 0.86 0.82 56
7 0.65 0.64 0.64 58
8 0.70 0.74 0.72 62
9 0.51 0.59 0.55 61
accuracy 0.57 600
macro avg 0.56 0.57 0.56 600
weighted avg 0.56 0.57 0.56 600
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix')
plt.show()
Spectral Features¶
In a spectoral representation of audio signals, we get time on x-axis and different frequencies on y-axis. Values in the matrix represent different properties of audio singal related to particular time and frequency. (amplitude, power ecc)
(h) Plot a power spectrogram of an arbitrary sound sample spectrogram on log scale (dB)¶
- Plot a power spectrogram of an arbitrary sound sample spectrogram on log scale (dB)
- use
powerSpectrum, freqenciesFound, time, imageAxis = plt.specgram(X[...], Fs=rate, scale = "dB")
# Plot the spectrogram of power on log scale
# fig, ax = plt.subplots(figsize = (8,6))
powerSpectrum, freqenciesFound, time, imageAxis = plt.specgram(X[np.random.randint(100)], Fs=rate, scale = "dB")
cbar = plt.gcf().colorbar(imageAxis)
cbar.set_label('db')
plt.grid()
plt.suptitle("Spectrogram of a signal")
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()
(i) Feature extraction from MEL spectogram¶
We have seen that both time and frequency domains contain useful information regarding the recordings.
We can leverage both by using the spectrogram of each signal.
To extract features from a MEL spectrogram of given signal, we divide it into N x N sub matrices of nearly identical shape.
- Compute the mean and standard deviation of these submatrices and consider them as features set.
- The number
Nof sub matrices is considered as an hyperparameter for the classifier. - You can use the helper function
ft_mean_std(X, n, f_s = 8000)to do this.
def ft_mean_std(X, n, f_s = 8000):
"""Computes mean and std of each n x n block of spectrograms of X
empty bins contains mean values of that column matrices
Parameters:
X: 2-d sampling array
n: number of rows or columns to split spectogram
Returns:
A 2-d numpy array - feature Matrix with n x 2 x n features as columns
"""
X_sp = [] #feature matrix
for x in X:
sp = power_to_db(melspectrogram(y=x, n_fft= len(x)))
x_sp = [] #current feature set
# split the rows
for v_split in np.array_split(sp, n, axis = 0):
# split the columns
for h_split in np.array_split(v_split, n, axis = 1):
if h_split.size == 0: #happens when number of culumns < n
m = np.median(v_split).__round__(4)
sd = np.std(v_split).__round__(4)
else:
m = np.mean(h_split).__round__(4)
sd = np.std(h_split).__round__(4)
x_sp.extend([m,sd])
X_sp.append(x_sp)
return np.array(X_sp)
X_ft = ft_mean_std(X, 10)
len(X_ft)
3000
Hyperparameter tuning¶
(j) Determine the optimum number N of bins¶
- Determine the optimum number
Nof bins in time-frequency domain (Nalong the time dimension andNalong the frequency dimension). - Program a for loop that varies
Nin therange(3,20,2)and train a Random Forest Classifier on the training set (80%) and validate it on the validation set (20%). - Select the optimum number of bins
N
models = {"rfc": RFC(random_state=rs, n_jobs=-1)}
scores = {}
for n in range(3,20,2):
X_ft = ft_mean_std(X, n)
X_train, X_test, y_train, y_test = train_test_split(X_ft, y, test_size = 0.20, random_state = rs)
score = []
for model in models:
clf = models[model]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
p,r,f,s = precision_recall_fscore_support(y_test, y_pred)
score.append((model, np.mean(f)))
scores[n] = score
rf_scores = [x[0][1] for x in scores.values()]
x = scores.keys()
plt.plot(x, rf_scores, 'o-', label = 'RF')
plt.grid(True)
plt.legend(loc = (1,.8))
plt.suptitle("Model evaluation on different n. of bins")
plt.xlabel("n. of bins")
plt.ylabel('mean f-score')
plt.show()
We can select 10 as initial number of bins. Both models are stable in the neighborhood of 10.
we can check the performance of models with their optimal configurations.
X_ft = ft_mean_std(X, 10)
X_train, X_test, y_train, y_test = train_test_split(X_ft, y, test_size = 0.20, random_state = rs)
Classification models¶
(k) Hypertune a Random Forest Classifier¶
- Hypertune a Random Forest Classifier using
N=10bins using a 5-fold crossvalidation with the following grid search:
param_grid = {
"n_estimators": [100,150,200],
"criterion": ["gini", "entropy"],
"min_impurity_decrease": [0.0,0.05,0.1]
}
clf = RFC(random_state = rs, n_jobs = -1 )
rf_search = GridSearchCV(clf, param_grid, scoring = "f1_macro", cv = 5)
rf_search.fit(X_train, y_train)
print("best Parameters for RF model:\n", rf_search.best_params_)
print("best score:", rf_search.best_score_)
print("\n\n Results on test dataset:\n\n")
y_pred = rf_search.predict(X_test)
print(classification_report(y_test, y_pred))
best Parameters for RF model:
{'criterion': 'entropy', 'min_impurity_decrease': 0.0, 'n_estimators': 200}
best score: 0.9485476561620784
Results on test dataset:
precision recall f1-score support
0 0.98 1.00 0.99 62
1 0.95 0.97 0.96 58
2 0.98 0.96 0.97 56
3 0.97 0.98 0.98 62
4 0.95 0.98 0.97 63
5 0.98 0.95 0.97 62
6 0.91 0.95 0.93 56
7 0.98 1.00 0.99 58
8 1.00 0.92 0.96 62
9 0.98 0.98 0.98 61
accuracy 0.97 600
macro avg 0.97 0.97 0.97 600
weighted avg 0.97 0.97 0.97 600
rfc = RFC(n_estimators= 200, criterion= 'gini', min_impurity_decrease= 0.0,random_state = rs, n_jobs = -1 )
scores = cross_val_score(rfc, X_ft, y, cv=10, scoring = 'accuracy', n_jobs = -1)
report = f"""Average accuracy of Random Forest model: {np.mean(scores):.2f}
with a standard deviation of {np.std(scores):.2f}
"""
print(report)
Average accuracy of Random Forest model: 0.92 with a standard deviation of 0.04
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix')
plt.show()
Appendix (optional)¶
Although results are quite satisfactory, we can use other techniques to split the spectogram matrix.
One other way of splitting spectorgram is to pad it such that each sub matrix has identical shape.
This way we also avoid the for-loops which is performance killer.
- This time, we use a Support Vector Classifier (see lecture 8).
def split(array,w_bins):
"""Split a matrix into sub-matrices of equal size."""
# original dimensions
rows, cols = array.shape
# size of sub matrices
sub_rows = rows//w_bins + 1 * rows%w_bins
sub_cols = cols//w_bins + 1 * cols%w_bins
# padding to properly fit
pad_rows = sub_rows*w_bins - rows
pad_cols = sub_cols*w_bins - cols
padded_array = np.pad(array, ((0,pad_rows), (0, pad_cols)))
rows, cols = padded_array.shape
return (padded_array.reshape(rows//sub_rows, sub_rows, -1, sub_cols)
.swapaxes(1, 2)
.reshape(-1, sub_rows, sub_cols))
def split_ft_mean_std(X, n):
""" Computes mean and std of each n x n block of spectrograms of X
bins are padded with zeros to equaly divide in n x n matrices.
Parameters:
X: 2-d sampling array
n: number of rows or columns to split spectogram
Returns:
A 2-d numpy array - feature Matrix with n x n x 2 features
"""
f_s = 8000
X_sp = [] #feature matrix
for x in X:
sp = power_to_db(melspectrogram(y=x, n_fft= len(x)))
blocks = split(sp,n)
mean = blocks.mean(axis = (-1,-2))
std = blocks.std(axis = (-1,-2))
X_sp.append(np.hstack((mean,std)))
return np.array(X_sp)
# %timeit -n2 -r1 ft_mean_std(X, 10)
%timeit -n2 -r1 split_ft_mean_std(X, 10)
19.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
As expected, this new method is twice as fast as the previous one.
Let's compare the results:
steps = [('scaler', StandardScaler()), ('SVM', svm.SVC())]
pipeline = Pipeline(steps)
parameteres = {'SVM__C':[5,10,20], 'SVM__kernel':["linear", "poly", "rbf"]}
X_ft = split_ft_mean_std(X, 10)
X_train, X_test, y_train, y_test = train_test_split(X_ft, y, test_size = 0.20, random_state = rs)
svm_search = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
svm_search.fit(X_train, y_train)
print("best Parameters for RF model:\n", svm_search.best_params_)
print("best score:", svm_search.best_score_)
print("\n\n Results on test dataset:\n\n")
y_pred = svm_search.predict(X_test)
print(classification_report(y_test, y_pred))
best Parameters for RF model:
{'SVM__C': 20, 'SVM__kernel': 'rbf'}
best score: 0.8891666666666668
Results on test dataset:
precision recall f1-score support
0 0.86 0.89 0.87 62
1 0.98 0.86 0.92 58
2 0.89 0.89 0.89 56
3 0.88 0.84 0.86 62
4 0.94 0.97 0.95 63
5 0.89 0.92 0.90 62
6 0.83 0.89 0.86 56
7 0.87 0.95 0.91 58
8 0.96 0.89 0.92 62
9 0.89 0.89 0.89 61
accuracy 0.90 600
macro avg 0.90 0.90 0.90 600
weighted avg 0.90 0.90 0.90 600
steps = [('scaler', StandardScaler()), ('SVM', svm.SVC(C= 20, kernel= 'rbf'))]
pipeline = Pipeline(steps)
scores = cross_val_score(pipeline, X_ft, y, cv=10, scoring = 'accuracy', n_jobs = -1)
report = f"""Average accuracy of SVM model: {np.mean(scores):.2f}
with a standard deviation of {np.std(scores):.2f}
"""
print(report)
Average accuracy of SVM model: 0.85 with a standard deviation of 0.05
We get comparable results but model is more efficient now.
The reuslts can be improved by tuning the appropriate number of splits (as we did earlier).
Concluisons¶
The proposed approach obtains results that are outperforming naive baseline we defined in the beginning.
It does so by leveraging both timeand frequency-based features.
We have empirically shown that the selected classifiers perform similarly for this specific task, achieving satisfactory results in terms of macro f1 score and accuracy.
We can further improve by using different set of hyperparameters. The results obtained, however, are already very promising. This classification problem is indeed quite easy and the datasets available are very limited.
ML08 A2 Simple SVM examples¶
MSE_FTP_MachLe, WÜRC
Excecute the following examples to get a feeling for the SVM and how to use it with scikit-learn. We start with a random two dimensional data set.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
Creation of data¶
We create 2 times 100 data points as follows:
%matplotlib inline
np.random.seed(42)
y = np.concatenate((np.repeat(-1,100),np.repeat(1,100)))
X = np.random.rand(200,2)
X[:,0] += 0.3*y
np.shape(y)
plt.scatter(X[:,0],X[:,1],c=y, cmap=plt.cm.Paired)
plt.show()
Training and prediction using a SVM¶
Execute the following code and adpot the code to make predictions for a few other points. Does the result make sense?
C = 0.4
svc = svm.SVC(kernel='linear', C=C).fit(X,y)
svc.predict([(-0.4,1)])
array([-1])
Cross validation¶
Play around with the code below. Which parameter C gives the best leave-one-out cross validation error?
from sklearn import model_selection
C = 0.03
svc = svm.SVC(kernel='linear', C=C)
loo = model_selection.LeaveOneOut()
# svc.fit(...).score() gives 1 if prediction is correct 0 otherwise
res = [svc.fit(X[train], y[train]).score(X[test], y[test]) for train, test in loo.split(X)]
#res is a vector with 0,1
np.mean(res) #The average accuracy
0.78
from sklearn import model_selection
Clist = np.logspace(-3,3,14)
for C in Clist:
svc = svm.SVC(kernel='linear', C=C)
loo = model_selection.LeaveOneOut()
# svc.fit(...).score() gives 1 if prediction is correct 0 otherwise
res = [svc.fit(X[train], y[train]).score(X[test], y[test]) for train, test in loo.split(X)]
#res is a vector with 0,1
print('C: %f \t accuracy: %f' % (C,np.mean(res))) #The average accuracy
C: 0.001000 accuracy: 0.000000 C: 0.002894 accuracy: 0.000000 C: 0.008377 accuracy: 0.000000 C: 0.024245 accuracy: 0.780000 C: 0.070170 accuracy: 0.780000 C: 0.203092 accuracy: 0.780000 C: 0.587802 accuracy: 0.765000 C: 1.701254 accuracy: 0.790000 C: 4.923883 accuracy: 0.785000 C: 14.251027 accuracy: 0.785000 C: 41.246264 accuracy: 0.785000 C: 119.377664 accuracy: 0.785000 C: 345.510729 accuracy: 0.785000 C: 1000.000000 accuracy: 0.785000
Parameter Optimization¶
The following code is adapted from here (originally from the scikit-learn repos) and shows how to systematically perform a parameter optimization.
To do so, we split the data into a train and test set. First, we use the training set to find the parameters which give the best accuracy.
Finding the optimal parameter for the training set¶
We evaluate a linear and a RBF kernel with different parameters.
from __future__ import print_function
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
n_samples = len(y)
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['linear'], 'C': [1, 10, 15, 100, 1000]},
{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]}]
score = 'accuracy'
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
scoring=score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
results = clf.cv_results_
for i in range(len(results["params"])):
print("%0.3f (+/-%0.03f) for %r" % (results["mean_test_score"][i], results["std_test_score"][i] * 2, results["params"][i]))
# Tuning hyper-parameters for accuracy
Best parameters set found on development set:
{'C': 1, 'kernel': 'linear'}
Grid scores on development set:
0.790 (+/-0.194) for {'C': 1, 'kernel': 'linear'}
0.780 (+/-0.174) for {'C': 10, 'kernel': 'linear'}
0.780 (+/-0.174) for {'C': 15, 'kernel': 'linear'}
0.790 (+/-0.194) for {'C': 100, 'kernel': 'linear'}
0.790 (+/-0.194) for {'C': 1000, 'kernel': 'linear'}
0.530 (+/-0.049) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.530 (+/-0.049) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.530 (+/-0.049) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.530 (+/-0.049) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.780 (+/-0.136) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.530 (+/-0.049) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.790 (+/-0.194) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.780 (+/-0.136) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
Evaluation of the optimal parameters on untouched test-set¶
We see that a SVM with a linear kernel is most appropriate. We now evaluate this parameters on the test set which we did not touch so far. Since we did not touch the test set yet, this performance is a good proxy for new unseen data (if it comes from the same distribution).
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
np.mean(y_true == y_pred)
precision recall f1-score support
-1 0.74 0.85 0.79 47
1 0.85 0.74 0.79 53
accuracy 0.79 100
macro avg 0.79 0.79 0.79 100
weighted avg 0.80 0.79 0.79 100
0.79
# Plot the decision boundaries and margins of the classifier
def PlotDecisionBoundary(model, X,y):
#plot decision boundary for model in case of 2D feature space
x1=X[:,0]
x2=X[:,1]
# Create grid to evaluate model
xx = np.linspace(min(x1), max(x1), len(x1))
yy = np.linspace(min(x1), max(x2), len(x2))
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
train_size = len(x1)
# Assigning different colors to the classes
colors = y
colors = np.where(colors == 1, '#8C7298', '#4786D1')
# Get the separating hyperplane
Z = model.decision_function(xy).reshape(XX.shape)
plt.scatter(x1, x2, c=colors)
# Draw the decision boundary and margins
plt.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
# Highlight support vectors with a circle around them
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=100,
linewidth=1, facecolors='none', edgecolors='k')
plt.title('g=%f | C=%f' % (model.gamma, model.C))
CList=np.logspace(-2,1,4)
gammaList=np.logspace(-2,1,4)
k=0
plt.figure(figsize=(20,20))
for C in CList:
for gamma in gammaList:
k=k+1
svc = svm.SVC(kernel='rbf', C=C, gamma=gamma, probability=True).fit(X,y)
plt.subplot(len(CList), len(gammaList), k)
PlotDecisionBoundary(svc, X,y)
ML08: A3 Support Vector Machine: Pima Indians¶
MSE_FTP_MachLe, WÜRC
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
Pima-Indian dataset¶
The data set pima-indians-diabetes.csv contains medical indicators collected from 768 patients of the indigenous Pima tribe (Phoenix, AZ). The subjects were between 21 and 81 years old. The following characteristics were recorded:
NumTimesPrg: Number of pregnanciesPlGlcConc: Blood sugar 2h after oral uptake of sugar (oGTT) (mg/dl)BloodP: Diastolic blood pressure (mm Hg)SkinThick: thickness of the skin fold at the triceps (mm)TwoHourSerIns: Insulin concentration after 2h oGTT ( 𝜇 IU/ mg)BMI:Body Mass Index (kg/m 2 )DiPedFunc: Hereditary predispositionAge: Age (years)HasDiabetes:Diagnosis Diabetes Type II
The classification goal is to make the diagnosis of type II diabetes based on the factors, i.e. to make a prediction model for the variable y= HasDiabetes using a support vector machine (SVM) with a radial basis function kernel.

Explorative Data Analysis (EDA)¶
Load the dataset pima-indians-diabetes.csv and create a pandas data frame from it. We take care that the column captions are imported correctly.
pima = pd.read_csv('./pima-indians-diabetes.csv', header=None)
pima.columns = ["NumTimesPrg", "PlGlcConc", "BloodP", "SkinThick", "TwoHourSerIns", "BMI", "DiPedFunc", "Age", "HasDiabetes"]
Display the first 5 entries of the dataset.
pima.head(5)
| NumTimesPrg | PlGlcConc | BloodP | SkinThick | TwoHourSerIns | BMI | DiPedFunc | Age | HasDiabetes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
It is already noticeable here that the insulin values of some patients have the value '0'.
(a) Calculate the percentage of patients with diabetes and display a statistics using df.describe¶
We can calculate the percentage of the patients with diabetes by determining the mean value of the response column 'HasDiabetes'. Using the dataframe.describe(), we can print the main statistics of the dataset.
perc_diab = pima['HasDiabetes'].mean()
print('percentage of diabetes: %f ' % np.round(perc_diab*100, 2))
y = pima['HasDiabetes']
X = pima.drop('HasDiabetes', axis=1)
X_names = X.columns
percentage of diabetes: 34.900000
pima.describe()
| NumTimesPrg | PlGlcConc | BloodP | SkinThick | TwoHourSerIns | BMI | DiPedFunc | Age | HasDiabetes | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
Some characteristics take the value 0, although this makes no medical sense. These are
PlGlcConcBloodPSkinThickTwoHourSerinsBMI
For example, the insulin value has at least a quarter missing. For these characteristics, we replace the 0 values with np.nan and then count again. For all other characteristics we do not know whether there are any other missing values.
(b) Exlude samples with zero entries or missing values¶
Replace the 0 values with np.nanand print again a statstical description of the dataset using df.describe(). Then drop np.nan values using df.dropna().
print(X_names.values)
['NumTimesPrg' 'PlGlcConc' 'BloodP' 'SkinThick' 'TwoHourSerIns' 'BMI' 'DiPedFunc' 'Age']
pima2 = pima.loc[:,X_names.values]
target= pima['HasDiabetes']
pima2.replace(0, np.nan,inplace=True)
pima2['HasDiabetes']=target
pima2.dropna(axis=0,inplace=True)
pima=pima2
pima.describe()
| NumTimesPrg | PlGlcConc | BloodP | SkinThick | TwoHourSerIns | BMI | DiPedFunc | Age | HasDiabetes | |
|---|---|---|---|---|---|---|---|---|---|
| count | 336.000000 | 336.000000 | 336.000000 | 336.000000 | 336.000000 | 336.000000 | 336.000000 | 336.000000 | 336.000000 |
| mean | 3.851190 | 122.279762 | 70.244048 | 28.663690 | 155.348214 | 32.297321 | 0.518702 | 31.836310 | 0.330357 |
| std | 3.148352 | 30.784649 | 12.363401 | 10.249863 | 118.777281 | 6.368558 | 0.327689 | 10.458446 | 0.471043 |
| min | 1.000000 | 56.000000 | 24.000000 | 7.000000 | 15.000000 | 18.200000 | 0.085000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 21.000000 | 76.000000 | 27.800000 | 0.268000 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 119.000000 | 70.000000 | 28.500000 | 125.500000 | 32.750000 | 0.446500 | 28.000000 | 0.000000 |
| 75% | 6.000000 | 144.000000 | 78.000000 | 36.000000 | 190.000000 | 36.250000 | 0.688250 | 38.000000 | 1.000000 |
| max | 17.000000 | 197.000000 | 110.000000 | 52.000000 | 846.000000 | 57.300000 | 2.329000 | 81.000000 | 1.000000 |
(c) Plot a histogram of the of each feature and the the target using df.hist()¶
plt.figure()
pima.hist(figsize=(12,12),bins=37)
plt.show()
<Figure size 640x480 with 0 Axes>
corrmatrix=pima.corr()
import seaborn as sns
corrmatrix
sns.heatmap(corrmatrix)
<AxesSubplot:>
(d) Split the data in 80% training and 20% test data¶
Use train_test_split from sklearn.model_selection. If you feed a pandas.Dataframe as an input to the method, you will also get pandas.Dataframes as output for the training and test features. This is quite practical.
from sklearn.model_selection import train_test_split
Xtr, Xtest, ytrain, ytest = train_test_split(pima.loc[:,X_names.values],pima.HasDiabetes,train_size=0.8)
Xtr.head()
| NumTimesPrg | PlGlcConc | BloodP | SkinThick | TwoHourSerIns | BMI | DiPedFunc | Age | |
|---|---|---|---|---|---|---|---|---|
| 539 | 3.0 | 129.0 | 92.0 | 49.0 | 155.0 | 36.4 | 0.968 | 32 |
| 503 | 7.0 | 94.0 | 64.0 | 25.0 | 79.0 | 33.3 | 0.738 | 41 |
| 498 | 7.0 | 195.0 | 70.0 | 33.0 | 145.0 | 25.1 | 0.163 | 55 |
| 555 | 7.0 | 124.0 | 70.0 | 33.0 | 215.0 | 25.5 | 0.161 | 37 |
| 420 | 1.0 | 119.0 | 88.0 | 41.0 | 170.0 | 45.3 | 0.507 | 26 |
(e) Standardize the features using the StandardScaler from sklearn.preprocessing¶
Standardize the features using the StandardScaler from sklearn.preprocessing and display the histograms of the features again.
from sklearn.preprocessing import StandardScaler, PowerTransformer
scaler = PowerTransformer()
Xtrain_scale = scaler.fit_transform(Xtr)
Xtest_scale = scaler.transform(Xtest)
Xtrain = pd.DataFrame(Xtrain_scale, columns=X.columns)
Xtrain.hist(bins=50, figsize=(20, 15))
plt.show()
(f) Train a support vector machine with a radial basis function kernel and determine the accuracy on the test data.¶
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
model=SVC(kernel='rbf', gamma=1, C=1)
model.fit(Xtrain, ytrain)
ypred=model.predict(Xtest_scale)
print('Accuracy: %f' % accuracy_score(ytest, ypred))
Accuracy: 0.691176
C:\Users\christoph.wuersch\.conda\envs\ML\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn(
(g) Perform a cross validated grid search GridSearchCV to find the best parameters for $\gamma$ and $C$ and determine the best parameters and score¶
- vary the value of $C$ in the range in a logarithmic scale from $10^{-3}$ to $10^{+3}$ (7 steps)
- vary the value of $\gamma$ in the range in a logarithmic scale from $10^{-3}$ to $10^{+3}$ (7 steps)
- print a classification report and a confusion matrix of the best classifier using
classification_reportandconfusion_matrixfromsklearn.metrics.
C_range = np.logspace(-3, 3, 7)
gamma_range = np.logspace(-3, 3, 7)
param_grid = dict(gamma=gamma_range, C=C_range)
model=SVC()
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
grid = GridSearchCV(model, param_grid=param_grid, cv=cv,n_jobs=-1)
grid.fit(Xtrain, ytrain)
print("The best parameters are %s with a score of %0.2f"
% (grid.best_params_, grid.best_score_))
The best parameters are {'C': 100.0, 'gamma': 0.01} with a score of 0.81
from sklearn.metrics import classification_report, confusion_matrix
ypred = grid.best_estimator_.predict(Xtest_scale)
target_names = ['negative', 'positive']
print(classification_report(ytest, ypred, target_names=target_names))
print(confusion_matrix(ytest, ypred))
precision recall f1-score support
negative 0.73 0.84 0.78 43
positive 0.63 0.48 0.55 25
accuracy 0.71 68
macro avg 0.68 0.66 0.66 68
weighted avg 0.70 0.71 0.70 68
[[36 7]
[13 12]]
C:\Users\christoph.wuersch\.conda\envs\ML\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn(
(h) Plot the resulting decision boundary of the SVM classifier for different values of $\gamma$ and $C$ as contour plot in 2D¶
Use the following features as the only features for the prediction such that we can display the decision boundary in a 2D plot.
- feature 1:
TwoHourSerIns(x1-axis) - feature 2:
Age(x2-axis)
Vary the C parameters in the following ranges:
C_range = [1e-2, 1, 1e2]gamma_range = [1e-1, 1, 1e1]
You can use the helper function PlotDecisionBoundary(model, X2D, y) to plot the decision boundary and the margins of the classifier.
print(X_names.values)
Xtrain_2D=Xtrain.loc[:,['TwoHourSerIns','Age']]
Xtrain_2D
['NumTimesPrg' 'PlGlcConc' 'BloodP' 'SkinThick' 'TwoHourSerIns' 'BMI' 'DiPedFunc' 'Age']
| TwoHourSerIns | Age | |
|---|---|---|
| 0 | 0.338792 | 0.393584 |
| 1 | -0.598462 | 1.093503 |
| 2 | 0.245121 | 1.660801 |
| 3 | 0.801149 | 0.832773 |
| 4 | 0.468854 | -0.410964 |
| ... | ... | ... |
| 263 | 0.092171 | 0.756580 |
| 264 | 1.373547 | 1.034186 |
| 265 | -0.547199 | -0.788200 |
| 266 | -0.844745 | -0.590624 |
| 267 | 0.903177 | 0.285078 |
268 rows × 2 columns
def PlotDecisionBoundary(model, X2D, y):
gamma=model.gamma
C=model.C
x1=X2D.iloc[:,0].values
x2=X2D.iloc[:,1].values
# evaluate decision function in a grid
xx, yy = np.meshgrid(np.linspace(-2, 2, 200), np.linspace(-2, 2, 200))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# visualize decision function for these parameters
plt.title("gamma=10^%d, C=10^%d" % (np.log10(gamma), np.log10(C)),
size='medium')
# visualize parameter's effect on decision function
plt.pcolormesh(xx, yy, -Z, cmap=plt.cm.RdBu,shading='nearest')
plt.contour(xx, yy, -Z,[-1, 0, 1])
plt.scatter(x1, x2, c=y, cmap=plt.cm.RdBu_r, edgecolors='k')
plt.xlim([-2,2])
plt.ylim([-2,2])
# Now we need to fit a classifier for all parameters in the 2d version
# (we use a smaller set of parameters here because it takes a while to train)
C_range = [1e-2, 1, 1e2]
gamma_range = [1e-1, 1, 1e1]
classifiers = []
for C in C_range:
for gamma in gamma_range:
clf = SVC(kernel='rbf', degree=2, C=C, gamma=gamma)
clf.fit(Xtrain_2D.values, ytrain)
classifiers.append((C, gamma, clf))
# #############################################################################
# Visualization
#
# draw visualization of parameter effects
for (k, (C, gamma, clf)) in enumerate(classifiers):
plt.figure(figsize=(8, 8))
# visualize decision function for these parameters
# plt.subplot(len(C_range), len(gamma_range), k + 1)
PlotDecisionBoundary(clf, Xtrain_2D, ytrain)
#plt.savefig('diabetes.pdf')
ML08: A4 SVM with polynomial and linear kernel¶
MSE, FTP_MachLe, WÜRC
Whenever you have a model that is represented with inner products, you can plug in a kernel function. For instance, a linear kernel is the same as applying linear transformations to feature space. And, in this case, it’s the same as a support vector classifier, because the decision boundary is linear.
- With polynomial kernels, you’re projecting the original feature space into a polynomial feature space. So the decision boundary that separates the classes is defined with a higher order polynomial.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def generate_random_dataset(size):
""" Generate a random dataset and that follows a quadratic distribution
"""
x = []
y = []
target = []
for i in range(size):
# class zero
x.append(np.round(random.uniform(0, 2.5), 1))
y.append(np.round(random.uniform(0, 20), 1))
target.append(0)
# class one
x.append(np.round(random.uniform(1, 5), 2))
y.append(np.round(random.uniform(20, 25), 2))
target.append(1)
x.append(np.round(random.uniform(3, 5), 2))
y.append(np.round(random.uniform(5, 25), 2))
target.append(1)
df_x = pd.DataFrame(data=x)
df_y = pd.DataFrame(data=y)
df_target = pd.DataFrame(data=target)
data_frame = pd.concat([df_x, df_y], ignore_index=True, axis=1)
data_frame = pd.concat([data_frame, df_target], ignore_index=True, axis=1)
data_frame.columns = ['x', 'y', 'target']
return data_frame
# Generate dataset
size = 100
dataset = generate_random_dataset(size)
X = dataset[['x', 'y']]
y = dataset['target']
dataset.to_csv('dataset.csv')
dataset.head()
| x | y | target | |
|---|---|---|---|
| 0 | 0.60 | 7.50 | 0 |
| 1 | 4.32 | 24.32 | 1 |
| 2 | 3.33 | 8.55 | 1 |
| 3 | 1.90 | 8.90 | 0 |
| 4 | 3.22 | 23.99 | 1 |
(a) Split the data in 70% training and 30% test data¶
using train_test_split from sklearn.model_selection.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
(b) Plot the training data as scatter plot using different colors for both classes.¶
# Plotting the training set
fig, ax = plt.subplots(figsize=(12, 7))
# adding major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
ax.scatter(x_train.x, x_train.y, c=y_train)
plt.show()
(c) Fit an SVM with a second-degree polynomial kernel using $\gamma=0.1$ and $C=1$¶
There’s a little space between the two groups of data points. But closer to the center, it’s not clear which data point belongs to which class. A quadratic curve might be a good candidate to separate these classes. So let’s fit an SVM with a second-degree polynomial kernel.
from sklearn import svm
model = svm.SVC(kernel='poly', degree=2,C=1,gamma=0.10)
model.fit(x_train, y_train)
SVC(C=1, degree=2, gamma=0.1, kernel='poly')
(d) Plot the margins and decision boundary of the classifier¶
- use the function
PlotDecisionBoundary(model,X,y) - The input arguments the instance of the trained model, the 2D array of the featues
Xand the 1D array of the targety.
# Plot the dataset
def PlotDecisionBoundary(model, X,y):
#plot decision boundary for model in case of 2D feature space
x1=X[:,0]
x2=X[:,1]
# Create grid to evaluate model
xx = np.linspace(min(x1), max(x1), len(x1))
yy = np.linspace(min(x1), max(x2), len(x2))
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
train_size = len(x1)
# Assigning different colors to the classes
colors = y
colors = np.where(colors == 1, '#8C7298', '#4786D1')
# Get the separating hyperplane
Z = model.decision_function(xy).reshape(XX.shape)
fig, ax = plt.subplots(figsize=(12, 7))
ax.scatter(x1, x2, c=colors)
# Draw the decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
# Highlight support vectors with a circle around them
ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=100,
linewidth=1, facecolors='none', edgecolors='k')
plt.title('g=%f | C=%f' % (model.gamma,model.C))
#plt.savefig('PolynomialKernel.pdf')
plt.show()
PlotDecisionBoundary(model, x_train.values,y_train)
(e) Vary the hyperparameter $\gamma$ logarithmically in the range from $10^{-2}$ to $10^{+2}$ in 5 steps.¶
- set $C=1$ and plot for each value of the parameter $\gamma$ the decision boundary and the marigns.
gammaList=np.logspace(-2,2,5)
for gamma in gammaList:
model = svm.SVC(kernel='poly', degree=2, C=1, gamma=gamma)
model.fit(x_train, y_train)
PlotDecisionBoundary(model, x_train.values,y_train)
(f) Vary the hyperparameter $C$ logarithmically in the range from $10^{0}$ to $10^{4}$ in 5 steps.¶
- set $\gamma=0.01$ and plot for each value of the parameter $C$ the decision boundary and the marigns.
gamma=0.001
CList=np.logspace(0,4,5)
for C in CList:
model = svm.SVC(kernel='rbf',C=C,gamma=gamma)
model.fit(x_train, y_train)
PlotDecisionBoundary(model, x_train.values,y_train)
Lab 9, A5: Illustration of prior and posterior Gaussian process for different kernels¶
Multivariate Gaussian distributions are useful for modeling finite collections of real-valued variables because of their nice analytical properties. Gaussian processes $\mathcal{GP}$ are the extension of multivariate Gaussians to infinite-sized collections of real-valued variables. In particular, this extension will allow us to think of Gaussian processes as distributions not just over random vectors but in fact distributions over random functions.
Unlike classical learning algorithm, Bayesian algorithms do not attempt to identify “best-fit” models of the data (or similarly, make “best guess” predictions for new test inputs). Instead, they compute a posterior distribution over models (or similarly, compute posterior predictive distributions for new test inputs). These distributions provide a useful way to quantify our uncertainty in model estimates, and to exploit our knowledge of this uncertainty in order to make more robust predictions on new test points.
Gaussian Process¶
A stochastic process is a collection of random variables, $\left\lbrace f(x) : x \in \mathcal{X} \right\rbrace$, indexed by elements from some set $\mathcal{X}$, known as the index set.
A Gaussian process is a stochastic process such that any finite subcollection of random variables has a multivariate Gaussian distribution. In particular, a collection of random variables $\left\lbrace f(x) : x \in \mathcal{X} \right\rbrace$ is said to be drawn from a Gaussian process $\mathcal{GP}$ with mean function $m(·)$ and covariance function $k(·, ·)$ if for any finite set of elements $x_1, . . . , x_m \in \mathcal{X}$, the associated finite set of random variables $f(x_1), \dots, f(x_m)$ have distribution,
When we form a Gaussian process $\mathcal{GP}$ we assume data is jointly Gaussian with a particular mean and covariance,
$$ p(\mathbf{f}|\mathbf{X}) \sim \mathcal{N}(\mathbf{m}(\mathbf{X}), \mathbf{K}(\mathbf{X})), $$$$ \mathbf{f} \sim \mathcal{GP}(\mathbf{m}(\mathbf{x}), \mathbf{k}(\mathbf{x,x'})), $$where the conditioning is on the inputs $\mathbf{X}$ which are used for computing the mean and covariance. For this reason they are known as mean and covariance functions. To make things clearer, let us assume, that the mean function $\mathbf{m}(\mathbf{X})$ is zero, i.e. the jointly Gaussian distribution is centered around zero.
In this case, the Gaussian process perspective takes the marginal likelihood of the data to be a joint Gaussian density with a covariance given by $\mathbf{K}$. So the model likelihood is of the form, $$ p(\mathbf{y}|\mathbf{X}) = \frac{1}{(2\pi)^{\frac{n}{2}}|\mathbf{K}|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}\mathbf{y}^\top \left(\mathbf{K}+\sigma^2 \mathbf{I}\right)^{-1}\mathbf{y}\right) $$ where the input data, $\mathbf{X}$, influences the density through the covariance matrix, $\mathbf{K}$ whose elements are computed through the covariance function, $k(\mathbf{x}, \mathbf{x}^\prime)$.
This means that the negative log likelihood (the objective function) is given by, $$ E(\boldsymbol{\theta}) = \frac{1}{2} \log |\mathbf{K}| + \frac{1}{2} \mathbf{y}^\top \left(\mathbf{K} + \sigma^2\mathbf{I}\right)^{-1}\mathbf{y} $$ where the parameters of the model are also embedded in the covariance function, they include the parameters of the kernel (such as lengthscale and variance), and the noise variance, $\sigma^2$. These parameters are called hyperparametres. Often, they are specified as the logarithm of the hyperparameters and are then called loghyperparameters.
Making Predictions - computing the posterior¶
We therefore have a probability density that represents functions. How do we make predictions with this density? The density is known as a process because it is consistent. By consistency, here, we mean that the model makes predictions for $\mathbf{f}$ that are unaffected by future values of $\mathbf{f}^*$ that are currently unobserved (such as test points). If we think of $\mathbf{f}^*$ as test points, we can still write down a joint probability density over the training observations, $\mathbf{f}$ and the test observations, $\mathbf{f}^*$. This joint probability density will be Gaussian, with a covariance matrix given by our covariance function, $k(\mathbf{x}_i, \mathbf{x}_j)$.
$$ \begin{bmatrix}\mathbf{f} \\ \mathbf{f}^*\end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K} & \mathbf{K}_\ast \\ \mathbf{K}_\ast^\top & \mathbf{K}_{\ast,\ast}\end{bmatrix}\right) $$where here $\mathbf{K}$ is the covariance computed between all the training points, $\mathbf{K}_\ast$ is the covariance matrix computed between the training points and the test points and $\mathbf{K}_{\ast,\ast}$ is the covariance matrix computed betwen all the tests points and themselves.
Conditional Density¶
Just as in naive Bayes, we first define the joint density, $p(\mathbf{y}, \mathbf{X})$ and now we need to define conditional distributions that answer particular questions of interest. In particular we might be interested in finding out the values of the function for the prediction function at the test data given those at the training data, $p(\mathbf{f}_*|\mathbf{f})$. Or if we include noise in the training observations then we are interested in the conditional density for the prediction function at the test locations given the training observations, $p(\mathbf{f}^*|\mathbf{y})$.
As ever all the various questions we could ask about this density can be answered using the sum rule and the product rule. For the multivariate normal density the mathematics involved is that of linear algebra, with a particular emphasis on the partitioned inverse or block matrix inverse, but they are beyond the scope of this course, so you don't need to worry about remembering them or rederiving them. We are simply writing them here because it is this conditional density that is necessary for making predictions.
The conditional density is also a multivariate normal, $$ p(\mathbf{f}^* | \mathbf{y}) \sim \mathcal{N}(\boldsymbol{\mu}_f,\mathbf{C}_f) $$ with a mean given by $$ \boldsymbol{\mu}_f = \mathbf{K}_*^\top \left[\mathbf{K} + \sigma^2 \mathbf{I}\right]^{-1} \mathbf{y} $$ and a covariance given by $$ \mathbf{C}_f = \mathbf{K}_{*,*} - \mathbf{K}_*^\top \left[\mathbf{K} + \sigma^2 \mathbf{I}\right]^{-1} \mathbf{K}_\ast. $$
But now let's look at some frequently used covariance functions:
(a) Analyze the code¶
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, Matern, RationalQuadratic,
ExpSineSquared, DotProduct,
ConstantKernel)
We first instantiate a list of kernels that we want to analyze. The following kernels (covariance functions) will be analyzed:
kernels = [1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0)),
1.0 * RationalQuadratic(length_scale=1.0, alpha=0.1),
1.0 * ExpSineSquared(length_scale=1.0, periodicity=3.0,
length_scale_bounds=(0.1, 10.0),
periodicity_bounds=(1.0, 10.0)),
ConstantKernel(0.1, (0.01, 10.0))
* (DotProduct(sigma_0=1.0, sigma_0_bounds=(0.0, 10.0)) ** 2),
1.0 * Matern(length_scale=1.0, length_scale_bounds=(1e-1, 10.0),
nu=1.5)]
for fig_index, kernel in enumerate(kernels):
# Specify Gaussian Process
gp = GaussianProcessRegressor(kernel=kernel)
# Plot prior
plt.figure(fig_index, figsize=(8, 8))
plt.subplot(2, 1, 1)
X_ = np.linspace(0, 5, 100)
y_mean, y_std = gp.predict(X_[:, np.newaxis], return_std=True)
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - y_std, y_mean + y_std,
alpha=0.2, color='k')
y_samples = gp.sample_y(X_[:, np.newaxis], 10)
plt.plot(X_, y_samples, lw=1)
plt.xlim(0, 5)
plt.ylim(-3, 3)
plt.title("Prior (kernel: %s)" % kernel, fontsize=12)
# Generate data and fit GP
rng = np.random.RandomState(4)
X = rng.uniform(0, 5, 10)[:, np.newaxis]
y = np.sin((X[:, 0] - 2.5) ** 2)
gp.fit(X, y)
# Plot posterior
plt.subplot(2, 1, 2)
X_ = np.linspace(0, 5, 100)
y_mean, y_std = gp.predict(X_[:, np.newaxis], return_std=True)
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - y_std, y_mean + y_std,
alpha=0.2, color='k')
y_samples = gp.sample_y(X_[:, np.newaxis], 10)
plt.plot(X_, y_samples, lw=1)
plt.scatter(X[:, 0], y, c='r', s=50, zorder=10, edgecolors=(0, 0, 0))
plt.xlim(0, 5)
plt.ylim(-3, 3)
plt.title("Posterior (kernel: %s)\n Log-Likelihood: %.3f"
% (gp.kernel_, gp.log_marginal_likelihood(gp.kernel_.theta)),
fontsize=12)
plt.tight_layout()
plt.show()
C:\Users\christoph.wuersch\.conda\envs\ML\lib\site-packages\sklearn\gaussian_process\kernels.py:430: ConvergenceWarning: The optimal value found for dimension 0 of parameter k2__alpha is close to the specified upper bound 100000.0. Increasing the bound and calling fit again may find a better value. warnings.warn(
C:\Users\christoph.wuersch\.conda\envs\ML\lib\site-packages\sklearn\gaussian_process\kernels.py:335: RuntimeWarning: divide by zero encountered in log
return np.log(np.vstack(bounds))
C:\Users\christoph.wuersch\.conda\envs\ML\lib\site-packages\sklearn\gaussian_process\_gpr.py:616: ConvergenceWarning: lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
_check_optimize_result("lbfgs", opt_res)
(b) Kernel functions¶
Have a look at the samples drawn from the given Gaussian processes for each kernel: RBF, Matern, RationalQuadratic, ExpSineSquared, DotProduct and ConstantKernel. Give an example for data that could be described by these covariance functions.
(i) The RBF kernel (also called Squared Exponential (SE) kernel)¶
The radial basis function kernel or short RBF kernel is a stationary kernel. Stationary means, the kernel $K(x,x')=K(x-x')$ is invariant to translations. It is also known as the “squared exponential” kernel. It is parameterized by a length-scale parameter $\lambda$, which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs (anisotropic variant of the kernel). The kernel is given by:
$$ k(x,x')= \sigma_0^2 \exp \left[ -\frac{1}{2} \left( \frac{x-x'}{\lambda} \right)^2 \right]$$This kernel is infinitely differentiable, which implies that $\mathcal{GP}$s with this kernel as covariance function have mean square derivatives of all orders, and are thus very smooth.
Example: We can use it to model and predict very smooth processes or signals.
(ii) the Rational-Quadratic kernel¶
The RationalQuadratic kernel can be seen as a scale mixture (an infinite sum) of RBF kernels with different characteristic length-scales. It is parameterized by a length-scale parameter $\lambda$ and a scale mixture parameter $\alpha$. Only the isotropic variant where $\lambda$ is a scalar is supported at the moment. The kernel is given by:
$$ k(x,x')=\left( 1 + \frac{( x-x')^2 }{2\alpha \lambda^2} \right)^{-\alpha}$$(iii) the Exp-Sine-Squared kernel¶
The Exp-Sine-Squared kernel allows modeling periodic functions. It is parameterized by a length-scale parameter $\lambda$ and a periodicity parameter $p$. Only the isotropic variant where is a scalar is supported at the moment. The kernel is given by:
$$k(x, x') = \exp \left[-\frac{ 2\cdot\sin^2 \left( \pi/ p \cdot \vert x-x' \vert \right)}{ \lambda^2} \right]$$Example: modeling periodic changes, e.g. daily temperature changes, seasonally changing quantities, such as the $\mathrm{CO}_2$ concentration.
(iv) the Matérn kernel¶
The Matern kernel is a stationary kernel and a generalization of the RBF kernel. It has an additional parameter $\nu$ which controls the smoothness of the resulting function. It is parameterized by a length-scale parameter $\lambda$ , which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs (anisotropic variant of the kernel). The kernel is given by:
$$k(x_i, x_j) = \sigma^2\frac{1}{\Gamma(\nu)2^{\nu-1}}\Bigg(\gamma\sqrt{2\nu} \frac{\vert x-x'\vert}{\lambda})\Bigg)^\nu \cdot K_\nu \Bigg(\gamma\sqrt{2\nu} \frac{\vert x-x'\vert}{\lambda})\Bigg)$$where $K_{\nu}$ is a modified Bessel function.
- It is stationary and isotropic. In the limit of $\nu \rightarrow \infty$, the matérn kernel converges to the RBF covariance function.
- for a finite $\nu$, the matérn kernel generates much rougher sample functions
- for the special case of $\nu= \frac{1}{2}$, the kernel gets $K(x, x') = \exp \left( − \frac{ \vert x − x' \vert}{\lambda} \right)$. This is called a Ornstein-Uhlenbeck process, yielding very rough sample functions (https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process)
Example: modeling stock prices, random walks, non-differentiable stochastic processes
(v) the Dot-Product kernel¶
The Dot-Product kernel is non-stationary and can be obtained from linear regression by putting $\sim \mathcal{N}(0,1)$ priors on the coefficients of $x_d (d=1\dots D)$ and a prior of $\sim \mathcal{N}(0,\sigma_0^2)$ on the bias. The Dot-Product kernel is invariant to a rotation of the coordinates about the origin, but not translations. It is parameterized by a parameter $\sigma_0^2$. For $\sigma_0^2=0$, the kernel is called the homogeneous linear kernel, otherwise it is inhomogeneous. The kernel is given by
$$ k(x,x')=\sigma_0^2 +x\cdot x'$$The Dot-Product kernel is commonly combined with exponentiation.
Example: polynomial regression
(c) How to model a periodic signal with noise using a $\mathcal{GP}$¶
Kernel functions are additive and multiplicative. To model a purely periodic process with white noise, we can construct a new kernel, consisitng of the Exp-Sine-Squared kernel plus the white noise kernel. In order to allow the model to adapt to variations of the amplitude $A$ of the periodic signal, we vary the periodic kernel using an RBF-kernel.
import numpy as np
import matplotlib.pyplot as plt
A = 1 #amplitude
p = 2 #periodicity
sigma0=0.001 #noise level
N = 200; #number of points
noise = np.random.normal(0,0.1,N)
X=np.linspace(0,10,N)
y=A*np.sin(2*np.pi*(X/p))+noise
X=X[:, np.newaxis]
y=y[:, np.newaxis]
#generate noisy sine wave
plt.figure()
plt.plot(X,y)
plt.grid(True)
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels \
import RBF, WhiteKernel, RationalQuadratic, ExpSineSquared
# variation of the amplitude of the sine
k1 = A**2 * RBF(length_scale=30)
# periodic component with variable amplitude
k2 = ExpSineSquared(length_scale=20, periodicity=p)
# noise terms
k3 = WhiteKernel(noise_level=sigma0**2)
kernel_gpml = k1*k2 + k3
gp = GaussianProcessRegressor(kernel=kernel_gpml,
alpha=0.0,
optimizer=None,
normalize_y=True)
gp.fit(X, y)
GaussianProcessRegressor(alpha=0.0,
kernel=1**2 * RBF(length_scale=30) * ExpSineSquared(length_scale=20, periodicity=2) + WhiteKernel(noise_level=1e-06),
normalize_y=True, optimizer=None)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GaussianProcessRegressor(alpha=0.0,
kernel=1**2 * RBF(length_scale=30) * ExpSineSquared(length_scale=20, periodicity=2) + WhiteKernel(noise_level=1e-06),
normalize_y=True, optimizer=None)X_ = np.linspace(X.min(), X.max() + 10, 100)
X_ = X_[:,np.newaxis]
y_pred, y_std = gp.predict(X_, return_std=True)
# Illustration
plt.scatter(X, y, c='k')
plt.plot(X_, y_pred)
plt.fill_between(X_[:,0], y_pred - 3*y_std,
y_pred + 3*y_std,
alpha=0.8, color='k')
plt.xlim(X_.min(), X_.max())
plt.xlabel("x")
plt.ylabel('noisy periodic signal')
plt.title('fitting a noisy periodic signal using $\mathcal{GP}$')
plt.tight_layout()
plt.grid(True)
plt.show()
np.shape(X_[:,0])
(100,)
(d) Matérn kernel¶
The Matern kernel is a stationary kernel and a generalization of the RBF kernel. It has an additional parameter $\nu$ which controls the smoothness of the resulting function. It is parameterized by a length-scale parameter $\lambda$ , which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs (anisotropic variant of the kernel). The kernel is given by:
$$k(x_i, x_j) = \sigma^2\frac{1}{\Gamma(\nu)2^{\nu-1}}\Bigg(\gamma\sqrt{2\nu} \frac{\vert x-x'\vert}{\lambda})\Bigg)^\nu \cdot K_\nu \Bigg(\gamma\sqrt{2\nu} \frac{\vert x-x'\vert}{\lambda})\Bigg)$$where $K_{\nu}$ is a modified Bessel function.
- It is stationary and isotropic. In the limit of $\nu \rightarrow \infty$, the matérn kernel converges to the RBF covariance function.
- for a finite $\nu$, the matérn kernel generates much rougher sample functions
- for the special case of $\nu= \frac{1}{2}$, the kernel gets $K(x, x') = \exp \left( − \frac{ \vert x − x' \vert}{\lambda} \right)$. This is called a Ornstein-Uhlenbeck process, yielding very rough sample functions (https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process)
Example: modeling stock prices, random walks, non-differentiable stochastic processes
Summary¶
We close our inspection of our Gaussian processes by pointing out some reasons why Gaussian processes are an attractive model for use in regression problems and in some cases may be preferable to alternative models (such as linear and locally-weighted linear regression):
- As Bayesian methods, Gaussian process models allow one to quantify uncertainty in predictions resulting not just from intrinsic noise in the problem but also the errors in the parameter estimation procedure. Furthermore, many methods for model selection and hyperparameter selection in Bayesian methods are immediately applicable to Gaussian processes (though we did not address any of these advanced topics here).
- Like locally-weighted linear regression, Gaussian process regression is non-parametric and hence can model essentially arbitrary functions of the input points.
- Gaussian process regression models provide a natural way to introduce kernels into a regression modeling framework. By careful choice of kernels, Gaussian process regression models can sometimes take advantage of structure in the data.
- Gaussian process regression models, though perhaps somewhat tricky to understand conceptually, nonetheless lead to simple and straightforward linear algebra implementations.
References¶
[1] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. Online: http://www.gaussianprocess.org/gpml/
[2] Chuong B. Do: Gaussian Processes, University of Stanford (2007)
[3] Neil D. Lawrence, Nicolas Durande: GPy introduction covariance functions, Machine Learning Summer School, Sydney, Australia (2015)
%matplotlib inline
Lab 9, A6: Gaussian process regression (GPR) with noise-level estimation¶
This example illustrates that GPR with a sum-kernel including a WhiteKernel can estimate the noise level of data.
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process import kernels
# Generate noisy sine wave
nSamples=300;
rng = np.random.RandomState(0)
X = rng.uniform(0, 5, nSamples)[:, np.newaxis]
y = 0.5 * np.sin(3 * X[:, 0]) + rng.normal(0, 0.3, X.shape[0])
(a) Inspect and interpret the data using a plot¶
Have a look at the data and make a good guess for the kernel to be selected.
plt.figure(figsize=(10,10))
plt.scatter(X,y)
plt.grid(True)
plt.xlabel('X')
plt.ylabel('X')
plt.title('raw noisy data')
Text(0.5, 1.0, 'raw noisy data')
(b) Create a suitable kernel for the covariance function¶
The data shows a sin-like oscillation that is noisy. So it makes sense to select for the oscillating part of the covariance either
- the sin-exponential kernel
- the RBF-kernel
For the noisy part, we could either choose a white noise kernel (WhiteKernel). We start with the RBF-kernel and additive white noise.
# First run: using the RBF kernel and white noise
kernel = 1.0 * kernels.RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
+ kernels.WhiteKernel(noise_level=1e-5, noise_level_bounds=(1e-10, 1e+1))
gp1 = GaussianProcessRegressor(kernel=kernel, alpha=1e-5).fit(X, y)
X_ = np.linspace(0, 10, 100)
y_mean, y_cov = gp1.predict(X_[:, np.newaxis], return_cov=True)
plt.figure(figsize=(10,10))
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - np.sqrt(np.diag(y_cov)),
y_mean + np.sqrt(np.diag(y_cov)),
alpha=0.5, color='k')
plt.plot(X_, 0.5*np.sin(3*X_), 'r', lw=3, zorder=9)
plt.scatter(X[:, 0], y, c='r', s=50, zorder=10, edgecolors=(0, 0, 0))
plt.title("Initial: %s\nOptimum: %s\nLog-Marginal-Likelihood: %s"
% (kernel, gp1.kernel_,
gp1.log_marginal_likelihood(gp1.kernel_.theta)))
plt.tight_layout()
y_cov.shape
plt.figure()
plt.matshow(y_cov)
plt.show()
<Figure size 432x288 with 0 Axes>
#Get out the hyperparameters
gp1.kernel_
0.426**2 * RBF(length_scale=0.53) + WhiteKernel(noise_level=0.0896)
(c) using the ExpSineSquared kernel¶
# Second run: using the sin-exponential kernel and white noise
kernel = 0.5 * kernels.ExpSineSquared(length_scale=4.0, periodicity=1,
length_scale_bounds=(1e-1, 1e3),
periodicity_bounds=(1e-1, 4)) \
+ kernels.WhiteKernel(noise_level=1e-6, noise_level_bounds=(1e-10, 1e+1))
gp = GaussianProcessRegressor(kernel=kernel, alpha=1E-5).fit(X, y)
X_ = np.linspace(0, 10, 100)
y_mean, y_cov = gp.predict(X_[:, np.newaxis], return_cov=True)
plt.figure(figsize=(10,10))
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - np.sqrt(np.diag(y_cov)),
y_mean + np.sqrt(np.diag(y_cov)),
alpha=0.5, color='k')
plt.plot(X_, 0.5*np.sin(3*X_), 'r', lw=3, zorder=9)
plt.scatter(X[:, 0], y, c='r', s=50, zorder=10, edgecolors=(0, 0, 0))
plt.title("Initial: %s\nOptimum: %s\nLog-Marginal-Likelihood: %s"
% (kernel, gp.kernel_,
gp.log_marginal_likelihood(gp.kernel_.theta)))
plt.tight_layout()
plt.figure()
plt.matshow(y_cov)
plt.show()
<Figure size 432x288 with 0 Axes>
#Get out the hyperparameters
gp.kernel_
0.44**2 * ExpSineSquared(length_scale=0.769, periodicity=4) + WhiteKernel(noise_level=0.0958)
References¶
[1] Jan Hendrik Metzen jhm@informatik.uni-bremen.de
Lab 10, A0 Principal Component Analysis¶
Up until now, we have been looking in depth at supervised learning estimators: those estimators that predict labels based on labeled training data. Here we begin looking at several unsupervised estimators, which can highlight interesting aspects of the data without reference to any known labels.
In this section, we explore what is perhaps one of the most broadly used of unsupervised algorithms, principal component analysis (PCA). PCA is fundamentally a dimensionality reduction algorithm, but it can also be useful as a tool for visualization, for noise filtering, for feature extraction and engineering, and much more. After a brief conceptual discussion of the PCA algorithm, we will see a couple examples of these further applications.
We begin with the standard imports:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
Introducing Principal Component Analysis¶
Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data, which we saw briefly in Introducing Scikit-Learn. Its behavior is easiest to visualize by looking at a two-dimensional dataset. Consider the following 200 points:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
n=np.shape(X)[0]
C=np.dot(X.T,X)/(n-1)
C
array([[0.68330628, 0.23079731],
[0.23079731, 0.09884853]])
# eigenvectors and eigenvalues from the scatter matrix
eig_val_sc, eig_vec_sc = np.linalg.eig(C)
eig_val_sc
array([0.76345505, 0.01869975])
eig_vec_sc
array([[ 0.94465994, -0.3280512 ],
[ 0.3280512 , 0.94465994]])
By eye, it is clear that there is a nearly linear relationship between the x and y variables. This is reminiscent of the linear regression data, but the problem setting here is slightly different: rather than attempting to predict the y values from the x values, the unsupervised learning problem attempts to learn about the relationship between the x and y values.
In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.
Using Scikit-Learn's PCA estimator, we can compute this as follows:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.9)
pca.fit(X)
PCA(n_components=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=0.9)
pca.n_components_
1
The fit learns some quantities from the data, most importantly the "components" and "explained variance":
print(pca.components_)
[[-0.94446029 -0.32862557]]
print(pca.explained_variance_ratio_)
[0.97634101]
To see what these numbers mean, let's visualize them as vectors over the input data, using the "components" to define the direction of the vector, and the "explained variance" to define the squared-length of the vector:
def draw_vector(v0, v1, ax=None):
ax = ax or plt.gca()
arrowprops=dict(arrowstyle='->',
linewidth=3,
color='k',
shrinkA=0, shrinkB=0)
ax.annotate('', v1, v0, arrowprops=arrowprops)
# plot data
plt.figure(figsize=(8,8))
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');
These vectors represent the principal axes of the data, and the length of the vector is an indication of how "important" that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes are the "principal components" of the data.
If we plot these principal components beside the original data, we see the plots shown here:
This transformation from data axes to principal axes is an affine transformation, which basically means it is composed of a translation, rotation, and uniform scaling.
While this algorithm to find principal components may seem like just a mathematical curiosity, it turns out to have very far-reaching applications in the world of machine learning and data exploration.
PCA as dimensionality reduction¶
Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.
Here is an example of using PCA as a dimensionality reduction transform:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape)
print("transformed shape:", X_pca.shape)
original shape: (200, 2) transformed shape: (200, 1)
The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');
The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance. The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much "information" is discarded in this reduction of dimensionality.
This reduced-dimension dataset is in some senses "good enough" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.
Principal Component Analysis Summary¶
In this section we have discussed the use of principal component analysis for dimensionality reduction, for visualization of high-dimensional data, for noise filtering, and for feature selection within high-dimensional data. Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines. Given any high-dimensional dataset, I tend to start with PCA in order to visualize the relationship between points (as we did with the digits), to understand the main variance in the data (as we did with the eigenfaces), and to understand the intrinsic dimensionality (by plotting the explained variance ratio). Certainly PCA is not useful for every high-dimensional dataset, but it offers a straightforward and efficient path to gaining insight into high-dimensional data.
PCA's main weakness is that it tends to be highly affected by outliers in the data.
For this reason, many robust variants of PCA have been developed, many of which act to iteratively discard data points that are poorly described by the initial components.
Scikit-Learn contains a couple interesting variants on PCA, including RandomizedPCA and SparsePCA, both also in the sklearn.decomposition submodule.
RandomizedPCA, which we saw earlier, uses a non-deterministic method to quickly approximate the first few principal components in very high-dimensional data, while SparsePCA introduces a regularization term that serves to enforce sparsity of the components.
In the following sections, we will look at other unsupervised learning methods that build on some of the ideas of PCA.
Lab10, A4 Necessity of Feature-Scaling for the PCA¶
Feature scaling through standardization (or Z score normalization) can be an important pre-processing step for many machine learning processes. Since many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require the normalization of features, we can analyze the importance of scaling data using the example of Principal Component Analysis (PCA).
In PCA, we are interested in the components that maximize variance. If one component (e.g., height) varies less than another (e.g., weight) just because different scales are used (meters vs. kilos), PCA could determine that the direction of maximum variance is assigned to weight rather than size if these characteristics are not scaled. But the change in height of one meter can be considered much more important than the change in weight of one kilogram. This assignment is therefore clearly wrong.
To illustrate this, we now go through a PCA by scaling the data with the class StandardScaler from the module sklearn.preprocessing. The results are visualized and compared with the results of unscaled data. We will notice a clearer difference when using standardization. The data set used is the wine data set available from UCI. This data set has continuous features that are heterogeneous due to the different magnitudes of the characteristics they measure (e.g. alcohol content and malic acid).
The transformed data is then used to train a naive Bayesian classifier. Significant differences in predictive accuracy can be observed, with the data set that was scaled before PCA was applied far exceeding the unscaled version.
(a) Import of the used classes¶
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.pipeline import make_pipeline
RANDOM_STATE = 42
FIG_SIZE = (10, 7)
features, target = load_wine(return_X_y=True)
target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2])
(b) Split in training and test dataset¶
# Make a train/test split using 30% test size
X_train, X_test, y_train, y_test = train_test_split(features, target,
test_size=0.30,
random_state=RANDOM_STATE)
(c) Create a pipeline¶
We use a Pipeline to perform a PCA with two main components and then train a Naive Bayes classifier. Then we look at the Accuracy on the test data. We do this without scaling the features.
# Fit to data and predict using pipelined GNB and PCA.
unscaled_clf = make_pipeline(PCA(n_components=2), GaussianNB())
unscaled_clf.fit(X_train, y_train)
pred_test = unscaled_clf.predict(X_test)
# Show prediction accuracy in unscaled data.
print('\nPrediction accuracy for the normal test dataset with PCA')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
Prediction accuracy for the normal test dataset with PCA 81.48%
(d) PCA using 4 principal components (without standardization / scaling of the data)¶
unscaled_clf = make_pipeline(PCA(n_components=4), GaussianNB())
unscaled_clf.fit(X_train, y_train)
pred_test = unscaled_clf.predict(X_test)
# Show prediction accuracy in unscaled data.
print('\nPrediction accuracy for the normal test dataset with PCA')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
Prediction accuracy for the normal test dataset with PCA 98.15%
(f) New pipeline with scaled features¶
The features are now scaled for comparison.
# Fit to data and predict using pipelined scaling, GNB and PCA.
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())
std_clf.fit(X_train, y_train)
pred_test_std = std_clf.predict(X_test)
(g) Prediction accuracy for the scaled data (using two principal components)¶
# Show prediction accuracies in scaled and unscaled data.
print('\nPrediction accuracy for the normal test dataset with PCA')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))
Prediction accuracy for the normal test dataset with PCA 98.15%
(h) plotting the main components¶
Now we get the main components, once for the unscaled, once for the scaled case.
# Extract PCA from pipeline
pca = unscaled_clf.named_steps['pca']
X_train_unscaled=pca.transform(X_train)
pca_std = std_clf.named_steps['pca']
# Show first principal componenets
print('\nPC 1 without scaling:\n', pca.components_[0])
print('\nPC 1 with scaling:\n', pca_std.components_[0])
PC 1 without scaling: [ 1.76342917e-03 -8.35544737e-04 1.54623496e-04 -5.31136096e-03 2.01663336e-02 1.02440667e-03 1.53155502e-03 -1.11663562e-04 6.31071580e-04 2.32645551e-03 1.53606718e-04 7.43176482e-04 9.99775716e-01] PC 1 with scaling: [ 0.13443023 -0.25680248 -0.0113463 -0.23405337 0.15840049 0.39194918 0.41607649 -0.27871336 0.33129255 -0.11383282 0.29726413 0.38054255 0.27507157]
The 1st main component in the unscaled set can be seen. You can see that feature #13 dominates the direction because it is several orders of magnitude above the other features. This is in contrast to viewing the main component for the scaled version of the data. In the scaled version, the orders of magnitude are approximately the same for all features.
# Scale and use PCA on X_train data for visualization.
scaler = std_clf.named_steps['standardscaler']
X_train_std = pca_std.transform(scaler.transform(X_train))
# visualize standardized vs. untouched dataset with PCA performed
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=FIG_SIZE)
for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
ax1.scatter(X_train_unscaled[y_train == l, 0], X_train_unscaled[y_train == l, 1],
color=c,
label='class %s' % l,
alpha=0.5,
marker=m
)
for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
ax2.scatter(X_train_std[y_train == l, 0], X_train_std[y_train == l, 1],
color=c,
label='class %s' % l,
alpha=0.5,
marker=m
)
ax1.set_title('Training dataset after PCA')
ax2.set_title('Standardized training dataset after PCA')
for ax in (ax1, ax2):
ax.set_xlabel('1st principal component')
ax.set_ylabel('2nd principal component')
ax.legend(loc='upper right')
ax.grid()
plt.tight_layout()
plt.show()
Lab10, A5 – Dimensionality Reduction¶
Exercice 5: Stochastic Neighbour Embedding on MNIST dataset¶
Setup¶
First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "dim_reduction"
def save_fig(fig_id, tight_layout=True):
path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format='png', dpi=600)
(a) Using t-SNE to reduce the dimensionality to two dimensions¶
Exercise: Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image's target class.
from sklearn.decomposition import PCA
Let's start by loading the MNIST dataset:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
#custom_data_home='C:/temp'
#mnist = fetch_openml('mnist_784', data_home=custom_data_home)
mnist.DESCR
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\datasets\_openml.py:1002: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details. warn(
"**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges \n**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown \n**Please cite**: \n\nThe MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples \n\nIt is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. \n\nWith some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets. \n\nThe MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.\n\nDownloaded from openml.org."
Dimensionality reduction on the full 60,000 images takes a very long time, so let's only do this on a random subset of 5,000 images:
np.random.seed(42)
m = 5000
idx = np.random.permutation(60000)[:m]
X = mnist.data.iloc[idx,:].values
y = mnist.target.iloc[idx].values
y[0:5]
['7', '3', '8', '9', '3'] Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']
#we save the data to disk so that we can fetch it fast if we need it using the next cell:
import pandas as pd
mnist.data.to_pickle('mnist_data_784.pkl')
mnist.target.to_pickle('mnist_label_784.pkl');
#read data from pickle file if there is no internet connection
# Xp=pd.read_pickle('mnist_data_784.pkl').values
# yp=pd.read_pickle('mnist_label_784.pkl').values
# m = 5000
# idx = np.random.permutation(60000)[:m]
# X = Xp[idx,:]
# y = yp[idx]
Now let's use t-SNE to reduce dimensionality down to 2D so we can plot the dataset (This can take quite a while...):
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)
X_reduced = tsne.fit_transform(X)
Now let's use Matplotlib's scatter() function to plot a scatterplot, using a different color for each digit:
plt.figure(figsize=(13,10))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y.astype(int), cmap="jet")
plt.axis('off')
plt.colorbar()
plt.show()
Isn't this just beautiful? :) This plot tells us which numbers are easily distinguishable from the others (e.g., 0s, 6s, and most 8s are rather well separated clusters), and it also tells us which numbers are often hard to distinguish (e.g., 4s and 9s, 5s and 3s, and so on).
(b) Labelling the Clusters¶
Let's focus on digits 2, 3 and 5, which seem to overlap a lot.
plt.figure(figsize=(8,8))
cmap = matplotlib.cm.get_cmap("jet")
for digit in (2, 3, 5):
color=np.array(cmap(digit / 9)).reshape(1,4)
plt.scatter(X_reduced[y.astype(int) == digit, 0], X_reduced[y.astype(int) == digit, 1], c=color,alpha=0.5)
#plt.axis('off')
plt.show()
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4253574472.py:2: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
t-SNE instead of PCA¶
Let's see if we can produce a nicer image by running t-SNE on these 3 digits:
y
['7', '3', '8', '9', '3', ..., '0', '4', '7', '4', '9'] Length: 5000 Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']
idx = (y == '2') | (y == '3') | (y == '5')
X_subset = X[idx]
y_subset = y[idx]
tsne_subset = TSNE(n_components=2, random_state=42)
X_subset_reduced = tsne_subset.fit_transform(X_subset)
plt.figure(figsize=(9,9))
for digit in (2, 3, 5):
color=np.array(cmap(digit / 9)).reshape(1,4)
plt.scatter(X_subset_reduced[y_subset.astype(int) == digit, 0],
X_subset_reduced[y_subset.astype(int) == digit, 1], c=color)
plt.axis('off')
plt.show()
Much better, now the clusters have far less overlap. But some 3s are all over the place. Plus, there are two distinct clusters of 2s, and also two distinct clusters of 5s. It would be nice if we could visualize a few digits from each cluster, to understand why this is the case. Let's do that now.
Exercise: Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits.
Let's create a plot_digits() function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples (plot_lle_digits, based on a different digit dataset).
from sklearn.preprocessing import MinMaxScaler
from matplotlib.offsetbox import AnnotationBbox, OffsetImage
def plot_digits(X, y, min_distance=0.05, images=None, figsize=(13, 10)):
# Let's scale the input features so that they range from 0 to 1
X_normalized = MinMaxScaler().fit_transform(X)
# Now we create the list of coordinates of the digits plotted so far.
# We pretend that one is already plotted far away at the start, to
# avoid `if` statements in the loop below
neighbors = np.array([[10., 10.]])
# The rest should be self-explanatory
plt.figure(figsize=figsize)
cmap = matplotlib.cm.get_cmap("jet")
digits = np.unique(y)
for digit in digits:
color=np.array(cmap(digit / 9)).reshape(1,4)
plt.scatter(X_normalized[y == digit, 0], X_normalized[y == digit, 1], c=color)
plt.axis("off")
ax = plt.gcf().gca() # get current axes in current figure
for index, image_coord in enumerate(X_normalized):
closest_distance = np.linalg.norm(np.array(neighbors) - image_coord, axis=1).min()
if closest_distance > min_distance:
neighbors = np.r_[neighbors, [image_coord]]
if images is None:
plt.text(image_coord[0], image_coord[1], str(int(y[index])),
color=cmap(y[index] / 9), fontdict={"weight": "bold", "size": 16})
else:
image = images[index].reshape(28, 28)
imagebox = AnnotationBbox(OffsetImage(image, cmap="binary"), image_coord)
ax.add_artist(imagebox)
Let's try it! First let's just write colored digits:
plot_digits(X_reduced, y.astype(int))
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
Well that's okay, but not that beautiful. Let's try with the digit images:
plot_digits(X_reduced, y.astype(int), images=X, figsize=(35, 25))
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
plot_digits(X_subset_reduced, y_subset.astype(int), images=X_subset, figsize=(22, 22))
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
Exercise: Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.
Let's start with PCA. We will also time how long it takes:
from sklearn.decomposition import PCA
import time
t0 = time.time()
X_pca_reduced = PCA(n_components=2, random_state=42).fit_transform(X)
t1 = time.time()
print("PCA took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_reduced, y.astype(int))
plt.show()
PCA took 0.1s.
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
Wow, PCA is blazingly fast! But although we do see a few clusters, there's way too much overlap. Let's try LLE:
(c) Using other dimensionality reduction algorithms such as LLE, MDS and t-SNE¶
First we start with LLE, i.e. Local Linear Embedding which is part of the mainfold class of sklearn.
from sklearn.manifold import LocallyLinearEmbedding
t0 = time.time()
X_lle_reduced = LocallyLinearEmbedding(n_components=2, random_state=42).fit_transform(X)
t1 = time.time()
print("LLE took {:.1f}s.".format(t1 - t0))
plot_digits(X_lle_reduced, y.astype(int))
plt.show()
LLE took 3.4s.
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
That took a while, and the result does not look too good. Let's see what happens if we apply PCA first, preserving 95% of the variance:
from sklearn.pipeline import Pipeline
pca_lle = Pipeline([
("pca", PCA(n_components=0.95, random_state=42)),
("lle", LocallyLinearEmbedding(n_components=2, random_state=42)),
])
t0 = time.time()
X_pca_lle_reduced = pca_lle.fit_transform(X)
t1 = time.time()
print("PCA+LLE took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_lle_reduced, y.astype(int))
plt.show()
PCA+LLE took 4.2s.
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
The result is more or less the same, but this time it was almost 4× faster.
Let's try MDS. It's much too long if we run it on 10,000 instances, so let's just try 2,000 for now:
from sklearn.manifold import MDS
m = 2000
t0 = time.time()
X_mds_reduced = MDS(n_components=2, random_state=42).fit_transform(X[:m])
t1 = time.time()
print("MDS took {:.1f}s (on just 2,000 MNIST images instead of 10,000).".format(t1 - t0))
plot_digits(X_mds_reduced, y[:m].astype(int))
plt.show()
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\manifold\_mds.py:298: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`. warnings.warn(
MDS took 185.0s (on just 2,000 MNIST images instead of 10,000).
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
Meh. This does not look great, all clusters overlap too much. Let's try with PCA first, perhaps it will be faster?
from sklearn.pipeline import Pipeline
pca_mds = Pipeline([
("pca", PCA(n_components=0.95, random_state=42)),
("mds", MDS(n_components=2, random_state=42)),
])
t0 = time.time()
X_pca_mds_reduced = pca_mds.fit_transform(X[:2000])
t1 = time.time()
print("PCA+MDS took {:.1f}s (on 2,000 MNIST images).".format(t1 - t0))
plot_digits(X_pca_mds_reduced, y[:2000].astype(int))
plt.show()
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\manifold\_mds.py:298: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`. warnings.warn(
PCA+MDS took 189.1s (on 2,000 MNIST images).
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
Same result, and no speedup: PCA did not help (or hurt).
Let's try LDA:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
t0 = time.time()
X_lda_reduced = LinearDiscriminantAnalysis(n_components=2).fit_transform(X, y)
t1 = time.time()
print("LDA took {:.1f}s.".format(t1 - t0))
plot_digits(X_lda_reduced, y.astype(int), figsize=(12,12))
plt.show()
LDA took 0.8s.
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
This one is very fast, and it looks nice at first, until you realize that several clusters overlap severely.
Well, it's pretty clear that t-SNE won this little competition, wouldn't you agree? We did not time it, so let's do that now:
from sklearn.manifold import TSNE
t0 = time.time()
X_tsne_reduced = TSNE(n_components=2, random_state=42).fit_transform(X)
t1 = time.time()
print("t-SNE took {:.1f}s.".format(t1 - t0))
plot_digits(X_tsne_reduced, y.astype(int))
plt.show()
t-SNE took 31.0s.
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
It's twice slower than LLE, but still much faster than MDS, and the result looks great. Let's see if a bit of PCA can speed it up:
pca_tsne = Pipeline([
("pca", PCA(n_components=0.95, random_state=42)),
("tsne", TSNE(n_components=2, random_state=42)),
])
t0 = time.time()
X_pca_tsne_reduced = pca_tsne.fit_transform(X)
t1 = time.time()
print("PCA+t-SNE took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_tsne_reduced, y.astype(int))
plt.show()
PCA+t-SNE took 33.6s.
C:\Users\colin\AppData\Local\Temp\ipykernel_10168\4093154656.py:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap("jet")
Yes, PCA roughly gave us a 25% speedup, without damaging the result. We have a winner!
Lab 11, A6: Feature Engineering using PCA¶
First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:
Setup¶
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "dim_reduction"
def save_fig(fig_id, tight_layout=True):
path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format='png', dpi=300)
(a) Loading the MINST Dataset¶
Exercise: Load the MNIST dataset (introduced in chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).
from sklearn.decomposition import PCA
#from sklearn.datasets import fetch_mldata
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\datasets\_openml.py:1002: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details. warn(
X_train = mnist['data'][:60000].values
y_train = mnist['target'][:60000].astype(int).values
X_test = mnist['data'][60000:].values
y_test = mnist['target'][60000:].astype(int).values
X_train
y_train
array([5, 0, 4, ..., 5, 6, 8])
#we save the data to disk so that we can fetch it fast if we need it using the next cell:
import pandas as pd
#mnist.data.to_pickle('mnist_data_784.pkl')
#mnist.target.to_pickle('mnist_label_784.pkl');
#read data from pickle file if there is no internet connection
X=pd.read_pickle('mnist_data_784.pkl').values
y=pd.read_pickle('mnist_label_784.pkl').values
X_train = X[:60000:,:]
y_train = y[:60000].astype(int)
X_test = X[60000:,:]
y_test = y[60000:].astype(int)
(b) Training a Random Forest classifier on the dataset¶
Exercise: Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set.
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(random_state=42)
import time
t0 = time.time()
rnd_clf.fit(X_train, y_train)
t1 = time.time()
print("Training took {:.2f}s".format(t1 - t0))
Training took 74.57s
from sklearn.metrics import accuracy_score
y_pred = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.9705
(c) Use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%¶
Exercise: Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)
Exercise: Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster?
rnd_clf2 = RandomForestClassifier(random_state=42)
t0 = time.time()
rnd_clf2.fit(X_train_reduced, y_train)
t1 = time.time()
print("Training took {:.2f}s".format(t1 - t0))
Training took 320.77s
Oh no! Training is actually more than twice slower now! How can that be? Well, as we saw in this chapter, dimensionality reduction does not always lead to faster training time: it depends on the dataset, the model and the training algorithm. See figure 8-6 (the manifold_decision_boundary_plot* plots above). If you try a softmax classifier instead of a random forest classifier, you will find that training time is reduced by a factor of 3 when using PCA. Actually, we will do this in a second, but first let's check the precision of the new random forest classifier.
(d) Evaluate the classifier on the test set: how does it compare to the previous classifier?¶
X_test_reduced = pca.transform(X_test)
y_pred = rnd_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)
0.9481
It is common for performance to drop slightly when reducing dimensionality, because we do lose some useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance. :(
Let's see if it helps when using softmax regression:
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42,max_iter=1000)
t0 = time.time()
log_clf.fit(X_train, y_train)
t1 = time.time()
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
print("Training took {:.2f}s".format(t1 - t0))
Training took 112.15s
y_pred = log_clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.921
Okay, so softmax regression takes much longer to train on this dataset than the random forest classifier, plus it performs worse on the test set. But that's not what we are interested in right now, we want to see how much PCA can help softmax regression. Let's train the softmax regression model using the reduced dataset:
log_clf2 = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42,max_iter=1000)
t0 = time.time()
log_clf2.fit(X_train_reduced, y_train)
t1 = time.time()
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
print("Training took {:.2f}s".format(t1 - t0))
Training took 45.11s
Nice! Reducing dimensionality led to a 4× speedup. :) Let's the model's accuracy:
y_pred = log_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)
0.9229
A very slight drop in performance, which might be a reasonable price to pay for a 4× speedup, depending on the application.
So there you have it: PCA can give you a formidable speedup... but not always!
Lab 10, A7 Principal Component Analysis for Noise Filtering¶
PCA as Noise Filtering¶
PCA can also be used as a filtering approach for noisy data. The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise. So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.
Let's see how this looks with the digits data. First we will plot several of the input noise-free data:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
#digits = load_digits()
#digits.data.shape
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784',parser='auto')
#digits.data = mnist['data'][:2000]
#digits.target = mnist['target'][:2000].astype(int)
mnist.data
| pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | pixel9 | pixel10 | ... | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | pixel784 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 69996 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 69997 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 69998 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 69999 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
70000 rows × 784 columns
digits=mnist
digits.data = mnist['data'][:2000]
digits.target = mnist['target'][:2000].astype(int)
data=digits.data.iloc[:,:2000].values
target=digits.target.iloc[:2000].values
def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(28, 28),
cmap='binary', interpolation='nearest',
clim=(0, 256))
plot_digits(data)
Now lets add some random noise to create a noisy dataset, and re-plot it:
np.random.seed(42)
noisy = np.random.normal(data, 40)
plot_digits(noisy)
It's clear by eye that the images are noisy, and contain spurious pixels. Let's train a PCA on the noisy data, requesting that the projection preserve 50% of the variance:
pca = PCA(0.50).fit(noisy)
pca.n_components_
22
Here 50% of the variance amounts to 2 principal components. Now we compute these components, and then use the inverse of the transform to reconstruct the filtered digits:
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)
This signal preserving/noise filtering property makes PCA a very useful feature selection routine—for example, rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs.
Principal Component Analysis Summary¶
In this section we have discussed the use of principal component analysis for dimensionality reduction, for visualization of high-dimensional data, for noise filtering, and for feature selection within high-dimensional data. Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines. Given any high-dimensional dataset, I tend to start with PCA in order to visualize the relationship between points (as we did with the digits), to understand the main variance in the data (as we did with the eigenfaces), and to understand the intrinsic dimensionality (by plotting the explained variance ratio). Certainly PCA is not useful for every high-dimensional dataset, but it offers a straightforward and efficient path to gaining insight into high-dimensional data.
PCA's main weakness is that it tends to be highly affected by outliers in the data.
For this reason, many robust variants of PCA have been developed, many of which act to iteratively discard data points that are poorly described by the initial components.
Scikit-Learn contains a couple interesting variants on PCA, including RandomizedPCA and SparsePCA, both also in the sklearn.decomposition submodule.
RandomizedPCA, which we saw earlier, uses a non-deterministic method to quickly approximate the first few principal components in very high-dimensional data, while SparsePCA introduces a regularization term that serves to enforce sparsity of the components.
In the following sections, we will look at other unsupervised learning methods that build on some of the ideas of PCA.
Lab11: A2_Elbow Curve and sklearn.cluster.KMeans¶
Based on Python Machine Learning 2nd Edition by Sebastian Raschka, Packt Publishing Ltd. 2017
import os
os.environ['OMP_NUM_THREADS'] = '1'
%matplotlib inline
from IPython.display import Image
import matplotlib.pyplot as plt
Grouping objects by similarity using k-means¶
K-means clustering using scikit-learn¶
Using the following code lines, you can generate two-dimensional data clusters that can be used for testing clustering algorithms. In this exercise, you will learn how to apply k-means and how to determine the optimum number of clusters using the elbow criterium of the inertia plot.
(a) Scatterplot¶
Generate a distribution of 8 clusters with 250 samples and plot them as a scatterplot. How many clusters do you recognize with your eye. Try to change the cluster standard deviation cluster_std until it will be hard for you to discriminate the 8 different clusters.
from sklearn.datasets import make_blobs
import numpy as np
n_samples = 250
n_features = 8
centers = 8
X, y = make_blobs(n_samples=n_samples,
n_features=n_features,
centers=centers,
cluster_std=1*np.array([0.8,0.9,1.2,1,1.0,0.7,0.9,0.9]),
shuffle=True,
random_state=0)
plt.scatter(X[:, 7], X[:, 4], c=y, marker='o', edgecolor='black', s=50)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_01.png', dpi=300)
plt.show()
b) KMeans¶
Import the method KMeans from sklearn.cluster. Instantiate a model km with 8 clusters
(n_clusters=8). Set the maximum number of iterations to max_iter=300 and n_init=10.
Fit the model to the data and predict the cluster label using km.fit_predict(X).
Hint: One way to deal with convergence problems is to choose larger values for tol, which
is a parameter that controls the tolerance with regard to the changes in the within-cluster
sum-squared-error to declare convergence. Try a tolerance of 1e-04.
from sklearn.cluster import KMeans
km = KMeans(n_clusters=8,
init='k-means++',
n_init=10,
max_iter=300,
tol=1e-04,
random_state=0)
y_km = km.fit_predict(X)
print(km.score(X))
print(km.inertia_)
-1590.0542314855682 1590.0542314855682
(c) Display the clustered data¶
Use the function PlotClusters to display the clustered data.
from matplotlib import colors as mcolors
colors = dict(mcolors.BASE_COLORS, **mcolors.CSS4_COLORS)
ColorNames=list(colors.keys())
HSV=colors.values()
def PlotClusters(X,y, km):
print("%i clusters" % km.n_clusters)
plt.figure()
for ClusterNumber in range(km.n_clusters):
plt.scatter(X[y == ClusterNumber, 7],
X[y == ClusterNumber, 4],
s=50, c=ColorNames[ClusterNumber+1],
marker='s', edgecolor='black',
label='cluster {0}'.format(ClusterNumber+1))
plt.scatter(km.cluster_centers_[:, 7],
km.cluster_centers_[:, 4],
s=250, marker='*',
c='red', edgecolor='black',
label='centroids')
plt.legend(scatterpoints=1)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_02.png', dpi=300)
plt.show()
d) Variation of the number k of clusters¶
Vary the number of clusters n_clusters=8 in your KMeans clustering algorithm from 4 to
8 and display each time the result using the function PlotClusters.
for n_clusters in range(2,9):
km = KMeans(n_clusters=n_clusters,
init='random',
n_init=10,
max_iter=300,
tol=1e-04,
random_state=0)
y_km = km.fit_predict(X)
PlotClusters(X,y_km, km)
2 clusters
3 clusters
4 clusters
5 clusters
6 clusters
7 clusters
8 clusters
# Important to check which features are used for the x and y coordinates when plotting!
# If points are closer to the centroid of a different cluster compared to the distance to their own centroid,
# then this can be due to the algorythm being in a local minimum. The global minimum is not always found.
# A change in the random seed for the initial centroid placement can lead to different minimums and better results.
Using the elbow method to find the optimal number of clusters¶
(e) Elbow Method¶
Vary in a for loop the number of clusters from n_clusters=8 to n_clusters=15 and
cluster the data each time using the km.fit_predict method. Read out the inertia
km.inertia_ and store it in a list called distortions as function of the number of clusters
using the append method. Display the inertia as function of the number of clusters
and determine the optimum number of clusters from the elbow curve.
print('Distortion: %.2f' % km.inertia_)
Distortion: 5524.97
km.inertia_
5524.96603004677
import numpy as np
distortions = []
ScoreList = []
InertiaList = []
maxNumberOfClusters=15
for k in range(1, maxNumberOfClusters):
km = KMeans(n_clusters=k,
init='k-means++',
n_init=10,
max_iter=300,
random_state=0)
km.fit(X)
distortions.append(km.inertia_)
BIC=-km.score(X)/np.sqrt(n_samples)+np.log(n_samples)*n_features*k
ScoreList.append(BIC)
InertiaList.append(-km.score(X)/np.sqrt(n_samples))
#plt.semilogy(range(1, maxNumberOfClusters), distortions, marker='o')
plt.plot(range(1, maxNumberOfClusters), ScoreList, marker='^',label='BIC')
plt.plot(range(1, maxNumberOfClusters), InertiaList, marker='o',label='inertia')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.legend()
plt.tight_layout()
plt.grid(True)
#plt.savefig('images/11_03.png', dpi=300)
plt.show()
(f) kMeans++¶
Without explicit defintion, a random seed is used to place the initial centroids, which
can sometimes result in bad clusterings or slow convergence. Another strategy is to place
the initial centroids far away from each other via the k-means++ algorithm, which leads
to better and more consistent results than the classic k-means. This can be selected in
sklearn.cluster.KMeans by setting init=k-means++.
- D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007). http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
for n_clusters in range(4,9):
km = KMeans(n_clusters=n_clusters,
init='k-means++',
n_init=10,
max_iter=300,
tol=1e-04,
random_state=0)
y_km = km.fit_predict(X)
PlotClusters(X,y, km)
4 clusters
5 clusters
6 clusters
7 clusters
8 clusters
Bonus¶
Quantifying the quality of clustering via silhouette plots¶
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_samples
km = KMeans(n_clusters=4,
init='k-means++',
n_init=10,
max_iter=300,
tol=1e-04,
random_state=0)
y_km = km.fit_predict(X)
cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
for i, c in enumerate(cluster_labels):
c_silhouette_vals = silhouette_vals[y_km == c]
c_silhouette_vals.sort()
y_ax_upper += len(c_silhouette_vals)
color = cm.jet(float(i) / n_clusters)
plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0,
edgecolor='none', color=color)
yticks.append((y_ax_lower + y_ax_upper) / 2.)
y_ax_lower += len(c_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color="red", linestyle="--")
plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.tight_layout()
#plt.savefig('images/11_04.png', dpi=300)
plt.show()
print(silhouette_avg)
0.5136457374680606
Comparison to "bad" clustering:
km = KMeans(n_clusters=8,
init='k-means++',
n_init=10,
max_iter=300,
tol=1e-04,
random_state=0)
y_km = km.fit_predict(X)
PlotClusters(X,y_km, km)
8 clusters
import numpy as np
cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
for i, c in enumerate(cluster_labels):
c_silhouette_vals = silhouette_vals[y_km == c]
c_silhouette_vals.sort()
y_ax_upper += len(c_silhouette_vals)
color = cm.jet(float(i) / n_clusters)
plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0,
edgecolor='none', color=color)
yticks.append((y_ax_lower + y_ax_upper) / 2.)
y_ax_lower += len(c_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color="red", linestyle="--")
plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.tight_layout()
#plt.savefig('images/11_06.png', dpi=300)
plt.show()
Organizing clusters as a hierarchical tree¶
Grouping clusters in bottom-up fashion¶
Image(filename='./images/11_05.png', width=400)
import pandas as pd
import numpy as np
np.random.seed(123)
variables = ['X', 'Y', 'Z']
labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4']
X = np.random.random_sample([5, 3])*10
df = pd.DataFrame(X, columns=variables, index=labels)
df
| X | Y | Z | |
|---|---|---|---|
| ID_0 | 6.964692 | 2.861393 | 2.268515 |
| ID_1 | 5.513148 | 7.194690 | 4.231065 |
| ID_2 | 9.807642 | 6.848297 | 4.809319 |
| ID_3 | 3.921175 | 3.431780 | 7.290497 |
| ID_4 | 4.385722 | 0.596779 | 3.980443 |
Performing hierarchical clustering on a distance matrix¶
from scipy.spatial.distance import pdist, squareform
row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')),
columns=labels,
index=labels)
row_dist
| ID_0 | ID_1 | ID_2 | ID_3 | ID_4 | |
|---|---|---|---|---|---|
| ID_0 | 0.000000 | 4.973534 | 5.516653 | 5.899885 | 3.835396 |
| ID_1 | 4.973534 | 0.000000 | 4.347073 | 5.104311 | 6.698233 |
| ID_2 | 5.516653 | 4.347073 | 0.000000 | 7.244262 | 8.316594 |
| ID_3 | 5.899885 | 5.104311 | 7.244262 | 0.000000 | 4.382864 |
| ID_4 | 3.835396 | 6.698233 | 8.316594 | 4.382864 | 0.000000 |
We can either pass a condensed distance matrix (upper triangular) from the pdist function, or we can pass the "original" data array and define the metric='euclidean' argument in linkage. However, we should not pass the squareform distance matrix, which would yield different distance values although the overall clustering could be the same.
# 1. incorrect approach: Squareform distance matrix
from scipy.cluster.hierarchy import linkage
row_clusters = linkage(row_dist, method='complete', metric='euclidean')
pd.DataFrame(row_clusters,
columns=['row label 1', 'row label 2',
'distance', 'no. of items in clust.'],
index=['cluster %d' % (i + 1)
for i in range(row_clusters.shape[0])])
C:\Users\colin\AppData\Local\Temp\ipykernel_7320\2571863248.py:5: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix row_clusters = linkage(row_dist, method='complete', metric='euclidean')
| row label 1 | row label 2 | distance | no. of items in clust. | |
|---|---|---|---|---|
| cluster 1 | 0.0 | 4.0 | 6.521973 | 2.0 |
| cluster 2 | 1.0 | 2.0 | 6.729603 | 2.0 |
| cluster 3 | 3.0 | 5.0 | 8.539247 | 3.0 |
| cluster 4 | 6.0 | 7.0 | 12.444824 | 5.0 |
# 2. correct approach: Condensed distance matrix
row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
pd.DataFrame(row_clusters,
columns=['row label 1', 'row label 2',
'distance', 'no. of items in clust.'],
index=['cluster %d' % (i + 1)
for i in range(row_clusters.shape[0])])
| row label 1 | row label 2 | distance | no. of items in clust. | |
|---|---|---|---|---|
| cluster 1 | 0.0 | 4.0 | 3.835396 | 2.0 |
| cluster 2 | 1.0 | 2.0 | 4.347073 | 2.0 |
| cluster 3 | 3.0 | 5.0 | 5.899885 | 3.0 |
| cluster 4 | 6.0 | 7.0 | 8.316594 | 5.0 |
# 3. correct approach: Input sample matrix
row_clusters = linkage(df.values, method='complete', metric='euclidean')
pd.DataFrame(row_clusters,
columns=['row label 1', 'row label 2',
'distance', 'no. of items in clust.'],
index=['cluster %d' % (i + 1)
for i in range(row_clusters.shape[0])])
| row label 1 | row label 2 | distance | no. of items in clust. | |
|---|---|---|---|---|
| cluster 1 | 0.0 | 4.0 | 3.835396 | 2.0 |
| cluster 2 | 1.0 | 2.0 | 4.347073 | 2.0 |
| cluster 3 | 3.0 | 5.0 | 5.899885 | 3.0 |
| cluster 4 | 6.0 | 7.0 | 8.316594 | 5.0 |
from scipy.cluster.hierarchy import dendrogram
# make dendrogram black (part 1/2)
# from scipy.cluster.hierarchy import set_link_color_palette
# set_link_color_palette(['black'])
row_dendr = dendrogram(row_clusters,
labels=labels,
# make dendrogram black (part 2/2)
# color_threshold=np.inf
)
plt.tight_layout()
plt.ylabel('Euclidean distance')
#plt.savefig('images/11_11.png', dpi=300,
# bbox_inches='tight')
plt.show()
Attaching dendrograms to a heat map¶
# plot row dendrogram
fig = plt.figure(figsize=(8, 8), facecolor='white')
axd = fig.add_axes([0.09, 0.1, 0.2, 0.6])
# note: for matplotlib < v1.5.1, please use orientation='right'
row_dendr = dendrogram(row_clusters, orientation='left')
# reorder data with respect to clustering
df_rowclust = df.iloc[row_dendr['leaves'][::-1]]
axd.set_xticks([])
axd.set_yticks([])
# remove axes spines from dendrogram
for i in axd.spines.values():
i.set_visible(False)
# plot heatmap
axm = fig.add_axes([0.23, 0.1, 0.6, 0.6]) # x-pos, y-pos, width, height
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([''] + list(df_rowclust.index))
#plt.savefig('images/11_12.png', dpi=300)
plt.show()
C:\Users\colin\AppData\Local\Temp\ipykernel_7320\1264724331.py:22: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axm.set_xticklabels([''] + list(df_rowclust.columns)) C:\Users\colin\AppData\Local\Temp\ipykernel_7320\1264724331.py:23: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axm.set_yticklabels([''] + list(df_rowclust.index))
Applying agglomerative clustering via scikit-learn¶
from sklearn.cluster import AgglomerativeClustering
ac = AgglomerativeClustering(n_clusters=3,
affinity='euclidean',
linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)
Cluster labels: [1 0 0 2 1]
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\cluster\_agglomerative.py:1005: FutureWarning: Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead warnings.warn(
ac = AgglomerativeClustering(n_clusters=2,
affinity='euclidean',
linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)
Cluster labels: [0 1 1 0 0]
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\cluster\_agglomerative.py:1005: FutureWarning: Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead warnings.warn(
Locating regions of high density via DBSCAN¶
Image(filename='images/11_13.png', width=500)
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=200, noise=0.1, random_state=0)
plt.scatter(X[:, 0], X[:, 1])
plt.tight_layout()
#plt.savefig('images/11_14.png', dpi=300)
plt.show()
K-means and hierarchical clustering:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))
km = KMeans(n_clusters=2, random_state=0)
y_km = km.fit_predict(X)
ax1.scatter(X[y_km == 0, 0], X[y_km == 0, 1],
edgecolor='black',
c='lightblue', marker='o', s=40, label='cluster 1')
ax1.scatter(X[y_km == 1, 0], X[y_km == 1, 1],
edgecolor='black',
c='red', marker='s', s=40, label='cluster 2')
ax1.set_title('K-means clustering')
ac = AgglomerativeClustering(n_clusters=2,
affinity='euclidean',
linkage='complete')
y_ac = ac.fit_predict(X)
ax2.scatter(X[y_ac == 0, 0], X[y_ac == 0, 1], c='lightblue',
edgecolor='black',
marker='o', s=40, label='cluster 1')
ax2.scatter(X[y_ac == 1, 0], X[y_ac == 1, 1], c='red',
edgecolor='black',
marker='s', s=40, label='cluster 2')
ax2.set_title('Agglomerative clustering')
plt.legend()
plt.tight_layout()
# plt.savefig('images/11_15.png', dpi=300)
plt.show()
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\cluster\_agglomerative.py:1005: FutureWarning: Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead warnings.warn(
Density-based clustering:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean')
y_db = db.fit_predict(X)
plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1],
c='lightblue', marker='o', s=40,
edgecolor='black',
label='cluster 1')
plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1],
c='red', marker='s', s=40,
edgecolor='black',
label='cluster 2')
plt.legend()
plt.tight_layout()
#plt.savefig('images/11_16.png', dpi=300)
plt.show()
y_db
array([ 0, 1, 1, -1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0,
1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1,
0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1,
0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, -1, 1,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,
0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
0, 0, 0, 0, 1, 1, -1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1], dtype=int64)
Lab 11: A3 k-Means, Gaussian Mixture Models and the EM algorithm¶
import warnings
warnings.filterwarnings("ignore")
from IPython.core.display import display, HTML
import time
import pandas as pd
#import pandas_datareader.data as web
import numpy as np
import scipy.stats as scs
from scipy.stats import multivariate_normal as mvn
import sklearn.mixture as mix
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
To gain understanding of mixture models we have to start at the beginning with the expectation maximization algorithm and it's application¶
First a little history on the EM-algorithm¶
Reference: 4
Demptser, Laird & Rubin (1977) -unified previously unrelated work under "The EM Algorithm" - unified previously unrelated work under "The EM Algorithm" - overlooked E-M works - see gaps between foundational authors - Newcomb (1887) - McKendrick (1926) [+39 years] - Hartley (1958) [+32 years] - Baum et. al. (1970) [+12 years] - Dempters et. al. (1977) [+7 years]
EM Algorithm developed over 90 years¶
EM provides general framework for solving problems¶
Examples include: - Filling in missing data from a sample set - Discovering values of latent variables - Estimating parameters of HMMs - Estimating parameters of finite mixtures [models] - Unsupervised learning of clusters - etc...
(a) How does the EM algorithm work?¶
EM is an iterative process that begins with a "naive" or random initialization and then alternates between the expectation and maximization steps until the algorithm reaches convergence.
To describe this in words imagine we have a simple data set consisting of class heights with groups separated by gender.
# import class heights
f = 'https://raw.githubusercontent.com/BlackArbsCEO/Mixture_Models/K-Means%2C-E-M%2C-Mixture-Models/Class_heights.csv'
#data = pd.read_csv(f)
#data.to_csv('Class_heights.csv')
# data.info()
data=pd.read_csv('Class_heights.csv',index_col=0)
height = data['Height (in)']
data
| Gender | Height (in) | |
|---|---|---|
| 0 | Male | 72 |
| 1 | Male | 72 |
| 2 | Female | 63 |
| 3 | Female | 62 |
| 4 | Female | 62 |
| 5 | Male | 73 |
| 6 | Female | 64 |
| 7 | Female | 63 |
| 8 | Female | 67 |
| 9 | Male | 71 |
| 10 | Male | 72 |
| 11 | Female | 63 |
| 12 | Male | 71 |
| 13 | Female | 67 |
| 14 | Female | 62 |
| 15 | Female | 63 |
| 16 | Male | 66 |
| 17 | Female | 60 |
| 18 | Female | 68 |
| 19 | Female | 65 |
| 20 | Female | 64 |
Now imagine that we did not have the convenient gender labels associated with each data point. How could we estimate the two group means?
First let's set up our problem.
In this example we hypothesize that these height data points are drawn from two distributions with two means - < $\mu_1$, $\mu_2$ >.
The heights are the observed $x$ values.
The hidden variables, which EM is going to estimate, can be thought of in the following way. Each $x$ value has 2 associated $z$ values. These $z$ values < $z_1$, $z_2$ > represent the distribution (or class or cluster) that the data point is drawn from.
Understanding the range of values the $z$ values can take is important.
In k-means, the two $z$'s can only take the values of 0 or 1. If the $x$ value came from the first distribution (cluster), then $z_1$=1 and $z_2$=0 and vice versa. This is called hard clustering.
In Gaussian Mixture Models, the $z$'s can take on any value between 0 and 1 because the x values are considered to be drawn probabilistically from 1 of the 2 distributions. For example $z$ values can be $z_1$=0.85 and $z_2$>=0.15, which represents a strong probability that the $x$ value came from distribution 1 and smaller probability that it came from distribution 2. This is called soft or fuzzy clustering.
For this example, we will assume the x values are drawn from Gaussian distributions.
To start the algorithm, we choose two random means.
From there we repeat the following until convergence.
The expectation step:¶
We calculate the expected values $E(z_{ij})$, which is the probability that $x_i$ was drawn from the $jth$ distribution.
$$E(z_{ij}) = \frac{p(x = x_i|\mu = \mu_j)}{\sum_{n=1}^2 p(x = x_i|\mu = \mu_j)}$$$$= \frac{ e^{-\frac{1}{2\sigma^2}(x_i - \mu_j)^2} } { \sum_{n=1}^2e^{-\frac{1}{2\sigma^2}(x_i - \mu_n)^2} }$$The formula simply states that the expected value for $z_{ij}$ is the probability $x_i$ given $\mu_j$ divided by the sum of the probabilities that $x_i$ belonged to each $\mu$
The maximization step:¶
After calculating all $E(z_{ij})$ values we can calculate (update) new $\mu$ values.
$$ \mu_j = \frac {\sum_{i=1}^mE(z_{ij})x_i} {\sum_{i=1}^mE(z_{ij})}$$This formula generates the maximum likelihood estimate.
By repeating the E-step and M-step we are guaranteed to find a local maximum giving us a maximum likelihood estimation of our hypothesis.
What are Maximum Likelihood Estimates (MLE)¶
1. Parameters describe characteristics (attributes) of a population. These parameter values are estimated from samples collected from that population.
2. A MLE is a parameter estimate that is most consistent with the sampled data. By definition it maximizes the likelihood function. One way to accomplish this is to take the first derivative of the likelihood function w/ respect to the parameter theta and solve for 0. This value maximizes the likelihood function and is the MLE
A quick example of a maximum likelihood estimate¶
You flip a coin 10 times and observe the following sequence (H, T, T, H, T, T, T, T, H, T)¶
What's the MLE of observing 3 heads in 10 trials?¶
simple answer:¶
The frequentist MLE is (# of successes) / (# of trials) or 3/10
solving first derivative of binomial distribution answer:¶
\begin{align} \mathcal L(\theta) & = {10 \choose 3}\theta^3(1-\theta)^7 \\[1ex] log\mathcal L(\theta) & = log{10 \choose 3} + 3log\theta + 7log(1 - \theta) \\[1ex] \frac{dlog\mathcal L(\theta)}{d(\theta)} & = \frac 3\theta - \frac{7}{1-\theta} = 0 \\[1ex] \frac 3\theta & = \frac{7}{1 - \theta} \Rightarrow \frac{3}{10} \end{align}That's a MLE! This is the estimate that is most consistent with the observed data¶
Back to our height example. Using the generalized Gaussian mixture model code sourced from Duke's computational statistics we can visualize this process.
# Code sourced from:
# http://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html
def em_gmm_orig(xs, pis, mus, sigmas, tol=0.01, max_iter=100):
n, p = xs.shape
k = len(pis)
ll_old = 0
for i in range(max_iter):
print('\nIteration: ', i)
print()
exp_A = []
exp_B = []
ll_new = 0
# E-step
ws = np.zeros((k, n))
for j in range(len(mus)):
for i in range(n):
ws[j, i] = pis[j] * mvn(mus[j], sigmas[j]).pdf(xs[i])
ws /= ws.sum(0)
# M-step
pis = np.zeros(k)
for j in range(len(mus)):
for i in range(n):
pis[j] += ws[j, i]
pis /= n
mus = np.zeros((k, p))
for j in range(k):
for i in range(n):
mus[j] += ws[j, i] * xs[i]
mus[j] /= ws[j, :].sum()
sigmas = np.zeros((k, p, p))
for j in range(k):
for i in range(n):
ys = np.reshape(xs[i]- mus[j], (2,1))
sigmas[j] += ws[j, i] * np.dot(ys, ys.T)
sigmas[j] /= ws[j,:].sum()
new_mus = (np.diag(mus)[0], np.diag(mus)[1])
new_sigs = (np.unique(np.diag(sigmas[0]))[0], np.unique(np.diag(sigmas[1]))[0])
df = (pd.DataFrame(index=[1, 2]).assign(mus = new_mus).assign(sigs = new_sigs))
xx = np.linspace(0, 100, 100)
yy = scs.multivariate_normal.pdf(xx, mean=new_mus[0], cov=new_sigs[0])
colors = sns.color_palette('Dark2', 3)
fig, ax = plt.subplots(figsize=(9, 7))
ax.set_ylim(-0.001, np.max(yy))
ax.plot(xx, yy, color=colors[1])
ax.axvline(new_mus[0], ymin=0., color=colors[1])
ax.fill_between(xx, 0, yy, alpha=0.5, color=colors[1])
lo, hi = ax.get_ylim()
ax.annotate(f'$\mu_1$: {new_mus[0]:3.2f}',
fontsize=12, fontweight='demi',
xy=(new_mus[0], (hi-lo) / 2),
xycoords='data', xytext=(80, (hi-lo) / 2),
arrowprops=dict(facecolor='black', connectionstyle="arc3,rad=0.2",shrink=0.05))
ax.fill_between(xx, 0, yy, alpha=0.5, color=colors[2])
yy2 = scs.multivariate_normal.pdf(xx, mean=new_mus[1], cov=new_sigs[1])
ax.plot(xx, yy2, color=colors[2])
ax.axvline(new_mus[1], ymin=0., color=colors[2])
lo, hi = ax.get_ylim()
ax.annotate(f'$\mu_2$: {new_mus[1]:3.2f}',
fontsize=12, fontweight='demi',
xy=(new_mus[1], (hi-lo) / 2), xycoords='data', xytext=(25, (hi-lo) / 2),
arrowprops=dict(facecolor='black', connectionstyle="arc3,rad=0.2",shrink=0.05))
ax.fill_between(xx, 0, yy2, alpha=0.5, color=colors[2])
dot_kwds = dict(markerfacecolor='white', markeredgecolor='black', markeredgewidth=1, markersize=10)
ax.plot(height, len(height)*[0], 'o', **dot_kwds)
ax.set_ylim(-0.001, np.max(yy2))
print(df.T)
# update complete log likelihoood
ll_new = 0.0
for i in range(n):
s = 0
for j in range(k):
s += pis[j] * mvn(mus[j], sigmas[j]).pdf(xs[i])
ll_new += np.log(s)
print(f'log_likelihood: {ll_new:3.4f}')
if np.abs(ll_new - ll_old) < tol:
break
ll_old = ll_new
return ll_new, pis, mus, sigmas
height = data['Height (in)']
n = len(height)
# Ground truthish
_mus = np.array([[0, data.groupby('Gender').mean().iat[0, 0]],
[data.groupby('Gender').mean().iat[1, 0], 0]])
_sigmas = np.array([[[5, 0], [0, 5]],
[[5, 0],[0, 5]]])
_pis = np.array([0.5, 0.5]) # priors
# initial random guesses for parameters
np.random.seed(0)
pis = np.random.random(2)
pis /= pis.sum()
mus = np.random.random((2,2))
sigmas = np.array([np.eye(2)] * 2) * height.std()
# generate our noisy x values
xs = np.concatenate([np.random.multivariate_normal(mu, sigma, int(pi*n))
for pi, mu, sigma in zip(_pis, _mus, _sigmas)])
ll, pis, mus, sigmas = em_gmm_orig(xs, pis, mus, sigmas)
# In the below plots the white dots represent the observed heights.
Iteration: 0
1 2
mus 61.362928 59.659685
sigs 469.240750 244.382352
log_likelihood: -141.8092
Iteration: 1
1 2
mus 68.73773 63.620554
sigs 109.85442 7.228183
log_likelihood: -118.0520
Iteration: 2
1 2
mus 70.569842 63.688825
sigs 4.424452 3.139277
log_likelihood: -100.2591
Iteration: 3
1 2
mus 70.569842 63.688825
sigs 4.424427 3.139278
log_likelihood: -100.2591
Notice how the algorithm was able to estimate the true means starting from random guesses for the parameters.¶
Now that we have a grasp of the algorithm we can examine K-Means as a form of EM¶
K-Means is an unsupervised learning algorithm used for clustering multidimensional data sets.
The basic form of K-Means makes two assumptions
1. Each data point is closer to its own cluster center than the other cluster centers
2. A cluster center is the arithmetic mean of all the points that belong to the cluster.
The expectation step is done by calculating the pairwise distances of every data point and assigning cluster membership to the closest center (mean)
The maximization step is simply the arithmetic mean of the previously assigned data points for each cluster
The following sections borrow heavily from Jake Vanderplas' Python Data Science Handbook¶
# Let's define some demo variables and make some blobs
# demo variables
k = 4
n_draws = 500
sigma = .7
random_state = 0
dot_size = 50
cmap = 'viridis'
# make blobs
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples = n_draws,
centers = k,
cluster_std = sigma,
random_state = random_state)
fig, ax = plt.subplots(figsize=(9,7))
ax.scatter(X[:, 0], X[:, 1], s=dot_size)
plt.title('k-means make blobs', fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'k-means make blobs')
# sample implementation
# code sourced from:
# http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.11-K-Means.ipynb
from sklearn.metrics import pairwise_distances_argmin
def find_clusters(X, n_clusters, rseed=2):
# 1. Random initialization (choose random clusters)
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]
while True:
# 2a. Assign labels based on closest center
labels = pairwise_distances_argmin(X, centers)
# 2b. Find new centers from means of points
new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])
# 2c. Check for convergence
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
# now let's compare this to the sklearn's KMeans() algorithm
# fit k-means to blobs
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
# visualize prediction
fig, ax = plt.subplots(figsize=(9,7))
ax.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=dot_size, cmap=cmap)
# get centers for plot
centers = kmeans.cluster_centers_
ax.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.75)
plt.title('sklearn k-means', fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'sklearn k-means')
# let's test the implementation
centers, labels = find_clusters(X, k)
fig, ax = plt.subplots(figsize=(9,7))
ax.scatter(X[:, 0], X[:, 1], c=labels, s=dot_size, cmap=cmap)
plt.title('find_clusters() k-means func', fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'find_clusters() k-means func')
To build our intuition of this process, play with the following interactive code from Jake Vanderplas in an Jupyter (IPython) notebook¶
# code sourced from:
# http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/06.00-Figure-Code.ipynb#Covariance-Type
from ipywidgets import interact
def plot_kmeans_interactive(min_clusters=1, max_clusters=6):
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=0.60)
def plot_points(X, labels, n_clusters):
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis',
vmin=0, vmax=n_clusters - 1);
def plot_centers(centers):
plt.scatter(centers[:, 0], centers[:, 1], marker='o',
c=np.arange(centers.shape[0]),
s=200, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], marker='o',
c='black', s=50)
def _kmeans_step(frame=0, n_clusters=4):
rng = np.random.RandomState(2)
labels = np.zeros(X.shape[0])
centers = rng.randn(n_clusters, 2)
nsteps = frame // 3
for i in range(nsteps + 1):
old_centers = centers
if i < nsteps or frame % 3 > 0:
labels = pairwise_distances_argmin(X, centers)
if i < nsteps or frame % 3 > 1:
centers = np.array([X[labels == j].mean(0)
for j in range(n_clusters)])
nans = np.isnan(centers)
centers[nans] = old_centers[nans]
# plot the data and cluster centers
plot_points(X, labels, n_clusters)
plot_centers(old_centers)
# plot new centers if third frame
if frame % 3 == 2:
for i in range(n_clusters):
plt.annotate('', centers[i], old_centers[i],
arrowprops=dict(arrowstyle='->', linewidth=1))
plot_centers(centers)
plt.xlim(-4, 4)
plt.ylim(-2, 10)
if frame % 3 == 1:
plt.text(3.8, 9.5, "1. Reassign points to nearest centroid",
ha='right', va='top', size=14)
elif frame % 3 == 2:
plt.text(3.8, 9.5, "2. Update centroids to cluster means",
ha='right', va='top', size=14)
return interact(_kmeans_step, frame=[0, 10, 20, 30, 40, 50, 300],
n_clusters=np.arange(min_clusters, max_clusters+1))
plot_kmeans_interactive(min_clusters=1, max_clusters=6)
interactive(children=(Dropdown(description='frame', options=(0, 10, 20, 30, 40, 50, 300), value=0), Dropdown(d…
<function __main__.plot_kmeans_interactive.<locals>._kmeans_step(frame=0, n_clusters=4)>
plot_kmeans_interactive(min_clusters=1, max_clusters=6)
interactive(children=(Dropdown(description='frame', options=(0, 10, 20, 30, 40, 50, 300), value=0), Dropdown(d…
<function __main__.plot_kmeans_interactive.<locals>._kmeans_step(frame=0, n_clusters=4)>
Now we are ready to explore some of the nuances/issues of implementing K-Means as an expectation maximization algorithm¶
the globally optimal result is not guaranteed¶
- EM is guaranteed to improve the result in each iteration but there are no guarantees that it will find the global best. See the following example, where we initalize the algorithm with a different seed.
practical solution:¶
- Run the algorithm w/ multiple random initializations
- This is done by default in sklearn
centers, labels = find_clusters(X, k, rseed=11)
fig, ax = plt.subplots(figsize=(9,7))
ax.set_title('sub-optimal clustering', fontsize=18, fontweight='demi')
ax.scatter(X[:, 0], X[:, 1], c=labels, s=dot_size, cmap=cmap)
<matplotlib.collections.PathCollection at 0x29931091c40>
number of means (clusters) have to be selected beforehand¶
- k-means cannot learn the optimal number of clusters from the data. If we ask for six clusters it will find six clusters, which may or may not be meaningful.
practical solution:¶
- use a more complex clustering algorithm like Gaussian Mixture Models, or one that can choose a suitable number of clusters (DBSCAN, mean-shift, affinity propagation)
labels6 = KMeans(6, random_state=random_state).fit_predict(X)
fig, ax = plt.subplots(figsize=(9,7))
ax.set_title('too many clusters', fontsize=18, fontweight='demi')
ax.scatter(X[:, 0], X[:, 1], c=labels6, s=dot_size, cmap=cmap)
<matplotlib.collections.PathCollection at 0x299311408b0>
from sklearn.datasets import make_moons
X_mn, y_mn = make_moons(500, noise=.07, random_state=random_state)
labelsM = KMeans(2, random_state=random_state).fit_predict(X_mn)
fig, ax = plt.subplots(figsize=(9,7))
ax.set_title('linear separation not possible', fontsize=18, fontweight='demi')
ax.scatter(X_mn[:, 0], X_mn[:, 1], c=labelsM, s=dot_size, cmap=cmap)
<matplotlib.collections.PathCollection at 0x299316f0c10>
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
assign_labels='kmeans')
labelsS = model.fit_predict(X_mn)
fig, ax = plt.subplots(figsize=(9,7))
ax.set_title('kernal transform to higher dimension\nlinear separation is possible', fontsize=18, fontweight='demi')
plt.scatter(X_mn[:, 0], X_mn[:, 1], c=labelsS, s=dot_size, cmap=cmap)
<matplotlib.collections.PathCollection at 0x29931765f40>
K-Means is known as a hard clustering algorithm because clusters are not allowed to overlap.¶
___"One way to think about the k-means model is that it places a circle (or, in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster. This radius acts as a hard cutoff for cluster assignment within the training set: any point outside this circle is not considered a member of the cluster.___ -- [Jake VanderPlas Python Data Science Handbook] 1
# k-means weaknesses that mixture models address directly
# code sourced from:
# http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb
from scipy.spatial.distance import cdist
def plot_kmeans(kmeans, X, n_clusters=k, rseed=2, ax=None):
labels = kmeans.fit_predict(X)
# plot input data
#ax = ax or plt.gca() # <-- nice trick
fig, ax = plt.subplots(figsize=(9,7))
ax.axis('equal')
ax.scatter(X[:, 0], X[:, 1],
c=labels, s=dot_size, cmap=cmap, zorder=2)
# plot the representation of Kmeans model
centers = kmeans.cluster_centers_
radii = [cdist(X[labels==i], [center]).max()
for i, center in enumerate(centers)]
for c, r in zip(centers, radii):
ax.add_patch(plt.Circle(c, r, fc='#CCCCCC',edgecolor='slategrey',
lw=4, alpha=0.5, zorder=1))
return
X3, y_true = make_blobs(n_samples = 400,
centers = k,
cluster_std = .6,
random_state = random_state)
X3 = X3[:, ::-1] # better plotting
kmeans = KMeans(n_clusters=k, random_state=random_state)
plot_kmeans(kmeans, X3)
plt.title('Clusters are hard circular boundaries', fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'Clusters are hard circular boundaries')
A resulting issue of K-Means' circular boundaries is that it has no way to account for oblong or elliptical clusters.¶
rng = np.random.RandomState(13)
X3_stretched = np.dot(X3, rng.randn(2, 2))
kmeans = KMeans(n_clusters=k, random_state=random_state)
plot_kmeans(kmeans, X3_stretched)
plt.title('Clusters cannot adjust to elliptical data structures',
fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'Clusters cannot adjust to elliptical data structures')
There are two ways we can extend K-Means¶
1. measure uncertainty in cluster assignments by comparing distances to all cluster centers
2. allow for flexibility in the shape of the cluster boundaries by using ellipses
Recall our previous height example, and let's assume that each cluster is a Gaussian distribution!¶
Gaussian distributions give flexibility to the clustering, and the same basic two step E-M algorithm used in K-Means is applied here as well.¶
Randomly initialize location and shape
Repeat until converged: E-step: for each point, find weights encoding the probability of membership in each cluster.
M-step: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights
The result of this process is that we end up with a smooth Gaussian cluster better fitted to the shape of the data, instead of a rigid inflexible circle.¶
Note that because we still are using the E-M algorithm there is no guarantee of a globally optimal result. We can visualize the results of the model.¶
# code sourced from:
# http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb
from matplotlib.patches import Ellipse
def draw_ellipse(position, covariance, ax=None, **kwargs):
"""Draw an ellipse with a given position and covariance"""
# Convert covariance to principal axes
if covariance.shape == (2, 2):
U, s, Vt = np.linalg.svd(covariance)
angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)
# Draw the Ellipse
for nsig in range(1, 4):
ax.add_patch(Ellipse(position, nsig * width, nsig * height,
angle, **kwargs))
def plot_gmm(gmm, X, label=True, ax=None):
fig, ax = plt.subplots(figsize=(9,7))
ax = ax or plt.gca()
labels = gmm.fit(X).predict(X)
if label:
ax.scatter(X[:, 0], X[:, 1], c=labels, s=dot_size, cmap=cmap, zorder=2)
else:
ax.scatter(X[:, 0], X[:, 1], s=dot_size, zorder=2)
ax.axis('equal')
w_factor = 0.2 / gmm.weights_.max()
for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
draw_ellipse(pos, covar, ax=ax, alpha=w * w_factor)
gmm = mix.GaussianMixture(n_components=k, random_state=random_state)
plot_gmm(gmm, X3)
# lets test on the stretched data set
gmm = mix.GaussianMixture(n_components=k, random_state=random_state+1)
plot_gmm(gmm, X3_stretched)
Notice how much better the model is able to fit the clusters when we assume each cluster is a Gaussian distribution instead of circle whose radius is defined by the most distant point.¶
Gaussian Mixture Models as a tool for Density Estimation¶
The technical term for this type of model is:¶
generative probabilistic model
Why you ask?¶
Because this model is really about characterizing the distribution of the entire dataset and not necessarily clustering. The power of these types of models is that they allow us to generate new samples that mimic the original underlying data!
gmm2 = mix.GaussianMixture(n_components=2, covariance_type='full',
random_state=random_state)
plot_gmm(gmm2, X_mn)
If we try to cluster this data set we run into the same issue as before.
Instead let's ignore individual clusters and model the whole distribution of data as a collection of many Gaussians.
gmm16 = mix.GaussianMixture(n_components=16, covariance_type='full',
random_state=random_state)
plot_gmm(gmm16, X_mn, label=False)
plt.title('Collective Gaussian clusters',
fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'Collective Gaussian clusters')
Looks like the collection of clusters has fit the data set reasonably well. Now let's see if the model has actually learned about this data set, such that we can create entirely new samples that look like the original.
Xnew, ynew = gmm16.sample(500)
fig, ax = plt.subplots(figsize=(9,7))
ax.scatter(Xnew[:, 0], Xnew[:, 1]);
ax.set_title('New samples drawn from fitted model',
fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'New samples drawn from fitted model')
Generative models allow for multiple methods to determine optimal number of components. Because it is a probability distribution we can evaluate the likelihood of the data using cross validation and/or using AIC or BIC.
Sklearn makes this easy.
n_components = np.arange(1, 21)
models = [mix.GaussianMixture(n, covariance_type='full',
random_state=random_state).fit(X_mn)
for n in n_components]
fig, ax = plt.subplots(figsize=(9,7))
ax.plot(n_components, [m.bic(X_mn) for m in models], label='BIC')
ax.plot(n_components, [m.aic(X_mn) for m in models], label='AIC')
ax.axvline(np.argmin([m.bic(X_mn) for m in models]), color='blue')
ax.axvline(np.argmin([m.aic(X_mn) for m in models]), color='green')
plt.legend(loc='best')
plt.xlabel('n_components')
Text(0.5, 0, 'n_components')
Lab11: A4 Image Compression using kMeans¶
Clustering can be used to reduce colors in an image. Similar colors will be assigned to the same
cluster label or color palette. In In the following exercise, you will load an image as a $\left[ w, h, 3\right]$
numpy.array of type float64 , where $w$ and $h$ are the width and height in pixels respecively.
The last dimension of the three dimensional array are three the RGB color channels. Using
kMeans, we will reduce the color depth from 24 bits to 64 colors (6 bits) and to 16 colors (4
bits).
(a) Reading an Image¶
Start by reading in an image from the Python imaging library PIL (https://en.wikipedia.org/wiki/Python_Imaging_Library) in your Jupyter notebook.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.utils import shuffle
from PIL import Image
# First we read and flatten the image.
original_img = np.array(Image.open('.\images\sunflower.jpg'), dtype=np.float64) / 255
print(original_img.shape)
original_dimensions = tuple(original_img.shape)
width, height, depth = tuple(original_img.shape)
(1440, 2560, 3)
<>:8: SyntaxWarning: invalid escape sequence '\i'
<>:8: SyntaxWarning: invalid escape sequence '\i'
C:\Users\colin\AppData\Local\Temp\ipykernel_3356\2770556459.py:8: SyntaxWarning: invalid escape sequence '\i'
original_img = np.array(Image.open('.\images\sunflower.jpg'), dtype=np.float64) / 255
(b) Flatten the image¶
Flatten the image to a $\left[w \cdot h, 3 \right]$-dimensional numpy.array and shuffle the pixels using sklearn.utils.shuffle.
image_flattened = np.reshape(original_img, (width * height, depth))
(c) Quantization to 64 colors¶
Create an instance of the kMeans class called estimator. Use the fit method of kMeans to create sixty-four clusters (n_clusters=64) from a sample of one thousand randomly selected colors, e.g. the first 1000 colors of the shuffled pixels. The new color palette is given by the cluster centers that are accessible in estimator.cluster_centers_.
Each of the clusters will be a color in the compressed palette.
image_array_sample = shuffle(image_flattened, random_state=0)[:1000]
estimator = KMeans(n_clusters=100, random_state=0)
estimator.fit(image_array_sample)
c:\Users\colin\miniconda3\envs\MachLe\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10)
KMeans(n_clusters=100, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=100, random_state=0)
(d) Prediction of the cluster assignment (labels)¶
Assign the cluster labels to each pixel in the original image using the .predict method
of your kMeans instance. Now, you know to which color in your reduced palette each pixel
belongs to, i.e. we predict the cluster assignment for each of the pixels in the original image.
cluster_assignments = estimator.predict(image_flattened)
(e) Create a compressed image using only 64 colors¶
Finally, we create the compressed image from the compressed palette and cluster assignments.
Loop over all pixels and assign the new color palette corresponding to the label of the pixel and create a new, reduced color picture. Plot the images using plt.imshow, compare the original image and the 64 color image. Try the same with 32 and 16 colors.
compressed_palette = estimator.cluster_centers_
compressed_img = np.zeros((width, height, compressed_palette.shape[1]))
label_idx = 0
for i in range(width):
for j in range(height):
compressed_img[i][j] = compressed_palette[cluster_assignments[label_idx]]
label_idx += 1
plt.figure(figsize=(12,12))
#plt.subplot(121)
plt.title('Original Image', fontsize=24)
plt.imshow(original_img, origin='lower')
plt.axis('off')
#plt.subplot(122)
plt.figure(figsize=(12,12))
plt.title('Compressed Image', fontsize=24)
plt.imshow(compressed_img,origin='lower')
plt.axis('off')
plt.show()
ML11 A5 Detecting similar Faces using DBSCAN?¶
The labelled faces dataset of sckit-learn contains gray scale images of 62 differnet famous personalites
from politics. In this exercise, we assume that there are no target labels, i.e. the names of
the persons are unknown. We want to find a method to cluster similar images. This can be done
using a dimensionality reduction algorithm like PCA for feature generation and a subsequent
clustering e.g. using DBSCAN.
%matplotlib inline
from IPython.display import set_matplotlib_formats, display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from cycler import cycler
plt.rcParams['image.cmap'] = "gray"
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
(a) Loading the Faces Dataset¶
Open the Jupyter notebook DBSCAN_DetectSimilarFaces.jpynb and have a look at the
first few faces of the dataset. Not every person is represented equally frequent in this
unbalanced dataset. For classification, we would have to take this into account. We extract
the first 50 images of each person and put them into a flat array called X_people. The
correspinding targets (y-values, names), are storeed in the y_people array.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
people = fetch_lfw_people(min_faces_per_person=20, resize=2)
image_shape = people.images[0].shape
fig, axes = plt.subplots(2, 5, figsize=(15, 8),
subplot_kw={'xticks': (), 'yticks': ()})
for target, image, ax in zip(people.target, people.images, axes.ravel()):
ax.imshow(image)
ax.set_title(people.target_names[target])
np.shape(people.images)
(3023, 250, 188)
left = 5
top = 5
right = image_shape[1]-left
bottom = image_shape[0]-top
import PIL
for img in people.images[0:3,:,:]:
#pil_img = PIL.Image.fromarray(np.uint8(img*255))
pil_img = PIL.Image.fromarray(img)
plt.figure()
plt.imshow(np.array(pil_img))
pil_img = pil_img.crop((left, top, right, bottom))
plt.figure()
plt.imshow(np.array(pil_img))
np.shape(np.array(pil_img))
(240, 178)
print("people.images.shape: {}".format(people.images.shape))
print("Number of classes: {}".format(len(people.target_names)))
people.images.shape: (3023, 250, 188) Number of classes: 62
people.target_names
array(['Alejandro Toledo', 'Alvaro Uribe', 'Amelie Mauresmo',
'Andre Agassi', 'Angelina Jolie', 'Ariel Sharon',
'Arnold Schwarzenegger', 'Atal Bihari Vajpayee', 'Bill Clinton',
'Carlos Menem', 'Colin Powell', 'David Beckham', 'Donald Rumsfeld',
'George Robertson', 'George W Bush', 'Gerhard Schroeder',
'Gloria Macapagal Arroyo', 'Gray Davis', 'Guillermo Coria',
'Hamid Karzai', 'Hans Blix', 'Hugo Chavez', 'Igor Ivanov',
'Jack Straw', 'Jacques Chirac', 'Jean Chretien',
'Jennifer Aniston', 'Jennifer Capriati', 'Jennifer Lopez',
'Jeremy Greenstock', 'Jiang Zemin', 'John Ashcroft',
'John Negroponte', 'Jose Maria Aznar', 'Juan Carlos Ferrero',
'Junichiro Koizumi', 'Kofi Annan', 'Laura Bush',
'Lindsay Davenport', 'Lleyton Hewitt', 'Luiz Inacio Lula da Silva',
'Mahmoud Abbas', 'Megawati Sukarnoputri', 'Michael Bloomberg',
'Naomi Watts', 'Nestor Kirchner', 'Paul Bremer', 'Pete Sampras',
'Recep Tayyip Erdogan', 'Ricardo Lagos', 'Roh Moo-hyun',
'Rudolph Giuliani', 'Saddam Hussein', 'Serena Williams',
'Silvio Berlusconi', 'Tiger Woods', 'Tom Daschle', 'Tom Ridge',
'Tony Blair', 'Vicente Fox', 'Vladimir Putin', 'Winona Ryder'],
dtype='<U25')
# count how often each target appears
counts = np.bincount(people.target)
# print counts next to target names:
for i, (count, name) in enumerate(zip(counts, people.target_names)):
print("{0:25} {1:3}".format(name, count), end=' ')
if (i + 1) % 3 == 0:
print()
Alejandro Toledo 39 Alvaro Uribe 35 Amelie Mauresmo 21 Andre Agassi 36 Angelina Jolie 20 Ariel Sharon 77 Arnold Schwarzenegger 42 Atal Bihari Vajpayee 24 Bill Clinton 29 Carlos Menem 21 Colin Powell 236 David Beckham 31 Donald Rumsfeld 121 George Robertson 22 George W Bush 530 Gerhard Schroeder 109 Gloria Macapagal Arroyo 44 Gray Davis 26 Guillermo Coria 30 Hamid Karzai 22 Hans Blix 39 Hugo Chavez 71 Igor Ivanov 20 Jack Straw 28 Jacques Chirac 52 Jean Chretien 55 Jennifer Aniston 21 Jennifer Capriati 42 Jennifer Lopez 21 Jeremy Greenstock 24 Jiang Zemin 20 John Ashcroft 53 John Negroponte 31 Jose Maria Aznar 23 Juan Carlos Ferrero 28 Junichiro Koizumi 60 Kofi Annan 32 Laura Bush 41 Lindsay Davenport 22 Lleyton Hewitt 41 Luiz Inacio Lula da Silva 48 Mahmoud Abbas 29 Megawati Sukarnoputri 33 Michael Bloomberg 20 Naomi Watts 22 Nestor Kirchner 37 Paul Bremer 20 Pete Sampras 22 Recep Tayyip Erdogan 30 Ricardo Lagos 27 Roh Moo-hyun 32 Rudolph Giuliani 26 Saddam Hussein 23 Serena Williams 52 Silvio Berlusconi 33 Tiger Woods 23 Tom Daschle 25 Tom Ridge 33 Tony Blair 144 Vicente Fox 32 Vladimir Putin 49 Winona Ryder 24
87*65
5655
mask = np.zeros(people.target.shape, dtype=bool)
for target in np.unique(people.target):
mask[np.where(people.target == target)[0][:50]] = 1
X_people = people.data[mask]
y_people = people.target[mask]
# scale the grey-scale values to be between 0 and 1
# instead of 0 and 255 for better numeric stability:
X_people = X_people / 255.
NumberOfPeople=np.unique(people.target).shape[0]
TargetNames = [];
n=5
#find the first 5 images from each person
fig, axes = plt.subplots(12, 5, figsize=(15, 30),
subplot_kw={'xticks': (), 'yticks': ()})
for target,ax in zip(np.unique(people.target),axes.ravel()):
#get the first n pictures from each person
indices=np.where(people.target == target)[0][1:n+1]
TargetNames.append(people.target_names[target])
image=people.images[indices[0]]
ax.imshow(image)
ax.set_title(str(target)+': '+TargetNames[target])
(b) Principal Component Analysis¶
Apply now a principal component analysis X_pca=pca.fit_transform(X_people) and
extract the first 100 components of each image. Reconstruct the first 10 entries of the dataset
using the 100 components of the PCA transformed data by applying the
pca.inverse_transform method and reshaping the image to the original size using
np.reshape.
What is the minimum number of components necessary such that you recognize the persons? Try it out.
NumberOfPeople
62
#extract eigenfaces from lfw data and transform data
from sklearn.decomposition import PCA
pca = PCA(n_components=100, whiten=True, random_state=0)
X_pca = pca.fit_transform(X_people)
#X_pca = pca.transform(X_people)
image_shape = people.images[0].shape
NumberOfSamples=X_pca.shape[0]
fig, axes = plt.subplots(2, 5, figsize=(15, 8),
subplot_kw={'xticks': (), 'yticks': ()})
for ix, target, ax in zip(np.arange(NumberOfSamples), y_people, axes.ravel()):
image=np.reshape(pca.inverse_transform(X_pca[ix,:]),image_shape)
ax.imshow(image)
ax.set_title(str(y_people[ix])+': '+people.target_names[target])
(c) Apply DBSCAN on these features¶
Import DBSCAN class from sklearn.cluster, generate an instance called dbscan and apply it to the pca transformed data X_pca and extract the cluster labels using labels = dbscan.fit_predict(X_pca). Use first the standard parameters for the method and check how many unique clusters the algorithm could find by analyzing the number of unique entries in the predicted cluster labels.
# apply DBSCAN with default parameters
from sklearn.cluster import DBSCAN
dbscan = DBSCAN()
labels = dbscan.fit_predict(X_pca)
print("Unique labels: {}".format(np.unique(labels)))
Unique labels: [-1]
(d) Variation of the eps parameter¶
Change the parameter eps of the dbscan using dbscan(min_samples=3, eps=5). Change
the value of eps in the range from 5 to 10 in steps of 0.5 using a for loop and check for
each value of eps how many clusters could be determined.
for eps in np.linspace(6,8,51):
print("\neps={}".format(eps))
dbscan = DBSCAN(eps=eps, min_samples=3)
labels = dbscan.fit_predict(X_pca)
print("Number of clusters: {}".format(len(np.unique(labels))))
print("Cluster sizes: {}".format(np.bincount(labels + 1)))
eps=6.0
Number of clusters: 3
Cluster sizes: [2052 7 4]
eps=6.04
Number of clusters: 3
Cluster sizes: [2052 7 4]
eps=6.08
Number of clusters: 3
Cluster sizes: [2052 7 4]
eps=6.12
Number of clusters: 3
Cluster sizes: [2052 7 4]
eps=6.16
Number of clusters: 3
Cluster sizes: [2051 7 5]
eps=6.2
Number of clusters: 3
Cluster sizes: [2051 7 5]
eps=6.24
Number of clusters: 3
Cluster sizes: [2051 7 5]
eps=6.28
Number of clusters: 3
Cluster sizes: [2051 7 5]
eps=6.32
Number of clusters: 3
Cluster sizes: [2051 7 5]
eps=6.36
Number of clusters: 3
Cluster sizes: [2051 7 5]
eps=6.4
Number of clusters: 3
Cluster sizes: [2051 7 5]
eps=6.44
Number of clusters: 2
Cluster sizes: [2051 12]
eps=6.48
Number of clusters: 3
Cluster sizes: [2048 3 12]
eps=6.52
Number of clusters: 3
Cluster sizes: [2048 3 12]
eps=6.5600000000000005
Number of clusters: 3
Cluster sizes: [2048 3 12]
eps=6.6
Number of clusters: 3
Cluster sizes: [2048 3 12]
eps=6.64
Number of clusters: 3
Cluster sizes: [2048 3 12]
eps=6.68
Number of clusters: 3
Cluster sizes: [2048 3 12]
eps=6.72
Number of clusters: 4
Cluster sizes: [2043 3 13 4]
eps=6.76
Number of clusters: 5
Cluster sizes: [2040 3 13 3 4]
eps=6.8
Number of clusters: 8
Cluster sizes: [2030 3 13 3 4 3 4 3]
eps=6.84
Number of clusters: 11
Cluster sizes: [2020 3 13 3 4 3 3 4 3 4 3]
eps=6.88
Number of clusters: 12
Cluster sizes: [2016 3 13 3 4 3 4 4 3 4 3 3]
eps=6.92
Number of clusters: 14
Cluster sizes: [2007 3 14 6 3 3 3 4 4 3 3 4 3 3]
eps=6.96
Number of clusters: 14
Cluster sizes: [2005 3 14 7 4 3 3 4 4 3 3 4 3 3]
eps=7.0
Number of clusters: 14
Cluster sizes: [2004 3 14 7 4 3 3 4 4 3 3 5 3 3]
eps=7.04
Number of clusters: 12
Cluster sizes: [2003 4 14 13 4 3 3 4 4 5 3 3]
eps=7.08
Number of clusters: 12
Cluster sizes: [2002 4 3 14 13 4 3 5 4 5 3 3]
eps=7.12
Number of clusters: 14
Cluster sizes: [1996 4 3 14 13 3 4 3 5 3 4 3 5 3]
eps=7.16
Number of clusters: 12
Cluster sizes: [1991 4 3 14 23 8 3 5 3 3 3 3]
eps=7.2
Number of clusters: 13
Cluster sizes: [1979 13 3 4 14 26 3 5 3 4 3 3 3]
eps=7.24
Number of clusters: 12
Cluster sizes: [1973 16 3 5 42 3 5 3 4 3 3 3]
eps=7.28
Number of clusters: 12
Cluster sizes: [1965 63 3 5 3 5 3 4 3 3 3 3]
eps=7.32
Number of clusters: 12
Cluster sizes: [1959 67 3 5 3 6 3 4 3 4 3 3]
eps=7.36
Number of clusters: 13
Cluster sizes: [1950 71 4 5 4 3 6 3 4 3 4 3 3]
eps=7.4
Number of clusters: 13
Cluster sizes: [1941 77 4 7 4 6 3 4 4 3 4 3 3]
eps=7.4399999999999995
Number of clusters: 13
Cluster sizes: [1937 80 7 4 7 4 3 4 4 3 4 3 3]
eps=7.48
Number of clusters: 14
Cluster sizes: [1926 90 7 5 4 3 3 4 4 3 5 3 3 3]
eps=7.52
Number of clusters: 14
Cluster sizes: [1917 93 12 5 3 5 3 3 4 5 3 4 3 3]
eps=7.5600000000000005
Number of clusters: 14
Cluster sizes: [1911 97 12 6 3 5 3 3 4 5 4 4 3 3]
eps=7.6
Number of clusters: 14
Cluster sizes: [1901 109 12 3 5 3 5 3 3 4 5 4 3 3]
eps=7.640000000000001
Number of clusters: 16
Cluster sizes: [1872 142 4 4 3 4 3 3 5 3 3 3 3 5
3 3]
eps=7.68
Number of clusters: 15
Cluster sizes: [1866 148 4 4 3 8 4 3 3 3 3 3 5 3
3]
eps=7.72
Number of clusters: 12
Cluster sizes: [1852 169 4 4 10 4 3 3 3 3 3 5]
eps=7.76
Number of clusters: 8
Cluster sizes: [1841 200 4 4 3 3 3 5]
eps=7.8
Number of clusters: 8
Cluster sizes: [1824 217 4 4 3 3 5 3]
eps=7.84
Number of clusters: 8
Cluster sizes: [1815 226 4 4 3 3 5 3]
eps=7.88
Number of clusters: 8
Cluster sizes: [1805 236 4 4 3 3 5 3]
eps=7.92
Number of clusters: 6
Cluster sizes: [1797 250 4 4 5 3]
eps=7.96
Number of clusters: 5
Cluster sizes: [1788 263 4 5 3]
eps=8.0
Number of clusters: 6
Cluster sizes: [1762 287 3 3 5 3]
(e) Maxumum number of clusters found¶
Select the value of eps where the numbers of clusters found is maximum and plot the
members of the clusters found using the follwing python code.
dbscan = DBSCAN(min_samples=3, eps=7.640000000000001)
labels = dbscan.fit_predict(X_pca)
for cluster in range(max(labels) + 1):
mask = labels == cluster
n_images = np.sum(mask)
if n_images<7:
fig, axes = plt.subplots(1, n_images, figsize=(n_images * 1.5, 4),
subplot_kw={'xticks': (), 'yticks': ()})
for image, label, ax in zip(X_people[mask], y_people[mask], axes):
ax.imshow(image.reshape(image_shape), vmin=0, vmax=1)
ax.set_title(people.target_names[label].split()[-1])
Bonus: Agglomerative and Spectral Clustering (optional)¶
# %% using other cluster algorithms learner on the pca transformed data
from time import time
from sklearn import cluster
from sklearn.neighbors import kneighbors_graph
n_clusters=14
clustering_names = ['SpectralClustering', 'Ward', 'AverageLinkage']
connectivity = kneighbors_graph(X_pca, n_neighbors=n_clusters, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
spectral = cluster.SpectralClustering(n_clusters=n_clusters,
eigen_solver='arpack',
affinity="nearest_neighbors")
ward = cluster.AgglomerativeClustering(n_clusters=n_clusters, linkage='ward',
connectivity=connectivity)
average_linkage = cluster.AgglomerativeClustering(
linkage="average", affinity="cityblock", n_clusters=n_clusters,
connectivity=connectivity)
clustering_algorithms = [spectral, ward, average_linkage]
# %matplotlib inline
for name, algorithm in zip(clustering_names, clustering_algorithms):
# predict cluster memberships
print(algorithm)
t0 = time()
algorithm.fit(X_pca)
t1 = time()
if hasattr(algorithm, 'labels_'):
labels = algorithm.labels_.astype(np.int)
else:
labels = algorithm.predict(X_pca)
print("%s: %.2g sec" % (name,t1 - t0))
print('labels found: %i' % (max(labels) + 1))
print("_____________________________________________")
print(" %s " % (name))
print("_____________________________________________")
for cluster in range(max(labels) + 1):
mask = labels == cluster
ind=np.where(mask==True)[0]
n_images = np.size(ind)
submask=np.zeros(X_pca.shape[0])
submask=submask.astype(dtype=bool)
submask[ind]=True
#n_images = np.sum(mask)
#print(n_images)
max_image=np.min([n_images,8])
print('max image: %i\n' % (max_image))
fig, axes = plt.subplots(1, max_image, figsize=(max_image * 3, 3),
subplot_kw={'xticks': (), 'yticks': ()})
if max_image==1:
print(ind[0])
image=X_people[ind[0]]
label=y_people[ind[0]]
plt.imshow(image.reshape(image_shape), vmin=0, vmax=1)
plt.title(people.target_names[label].split()[-1])
else:
for image, label, ax in zip(X_people[submask], y_people[submask], axes):
ax.imshow(image.reshape(image_shape), vmin=0, vmax=1)
ax.set_title(people.target_names[label].split()[-1])
plt.show()
SpectralClustering(affinity='nearest_neighbors', eigen_solver='arpack',
n_clusters=14)
C:\Users\christoph.wuersch\AppData\Local\Temp\ipykernel_22540\2706636364.py:37: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations labels = algorithm.labels_.astype(np.int)
SpectralClustering: 1.7 sec
labels found: 14
_____________________________________________
SpectralClustering
_____________________________________________
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
AgglomerativeClustering(connectivity=<2063x2063 sparse matrix of type '<class 'numpy.float64'>'
with 53372 stored elements in Compressed Sparse Row format>,
n_clusters=14)
C:\Users\christoph.wuersch\AppData\Local\Temp\ipykernel_22540\2706636364.py:37: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations labels = algorithm.labels_.astype(np.int)
Ward: 0.83 sec
labels found: 14
_____________________________________________
Ward
_____________________________________________
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
max image: 8
AgglomerativeClustering(affinity='cityblock',
connectivity=<2063x2063 sparse matrix of type '<class 'numpy.float64'>'
with 53372 stored elements in Compressed Sparse Row format>,
linkage='average', n_clusters=14)
C:\Users\christoph.wuersch\AppData\Local\Temp\ipykernel_22540\2706636364.py:37: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations labels = algorithm.labels_.astype(np.int)
AverageLinkage: 2.8 sec
labels found: 14
_____________________________________________
AverageLinkage
_____________________________________________
max image: 8
max image: 2
max image: 2
max image: 1 1606
max image: 1 1989
max image: 1 1982
max image: 1 1219
max image: 1 1090
max image: 1 1543
max image: 1 661
max image: 1 1507
max image: 1 627
max image: 1 1881
max image: 1 595
k-Means, Gaussian Mixture Models and the EM algorithm¶
import warnings
warnings.filterwarnings("ignore")
from IPython.core.display import display, HTML
import time
import pandas as pd
#import pandas_datareader.data as web
import numpy as np
import scipy.stats as scs
from scipy.stats import multivariate_normal as mvn
import sklearn.mixture as mix
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
To gain understanding of mixture models we have to start at the beginning with the expectation maximization algorithm and it's application¶
First a little history on the EM-algorithm¶
Reference: 4
Demptser, Laird & Rubin (1977) -unified previously unrelated work under "The EM Algorithm" - unified previously unrelated work under "The EM Algorithm" - overlooked E-M works - see gaps between foundational authors - Newcomb (1887) - McKendrick (1926) [+39 years] - Hartley (1958) [+32 years] - Baum et. al. (1970) [+12 years] - Dempters et. al. (1977) [+7 years]
EM Algorithm developed over 90 years¶
EM provides general framework for solving problems¶
Examples include: - Filling in missing data from a sample set - Discovering values of latent variables - Estimating parameters of HMMs - Estimating parameters of finite mixtures [models] - Unsupervised learning of clusters - etc...
(a) How does the EM algorithm work?¶
EM is an iterative process that begins with a "naive" or random initialization and then alternates between the expectation and maximization steps until the algorithm reaches convergence.
To describe this in words imagine we have a simple data set consisting of class heights with groups separated by gender.
# import class heights
f = 'https://raw.githubusercontent.com/BlackArbsCEO/Mixture_Models/K-Means%2C-E-M%2C-Mixture-Models/Class_heights.csv'
data = pd.read_csv(f)
data.to_csv('Class_heights.csv')
# data.info()
height = data['Height (in)']
data.head()
| Gender | Height (in) | |
|---|---|---|
| 0 | Male | 72 |
| 1 | Male | 72 |
| 2 | Female | 63 |
| 3 | Female | 62 |
| 4 | Female | 62 |
data.describe()
| Height (in) | |
|---|---|
| count | 21.000000 |
| mean | 66.190476 |
| std | 4.130606 |
| min | 60.000000 |
| 25% | 63.000000 |
| 50% | 65.000000 |
| 75% | 71.000000 |
| max | 73.000000 |
Now imagine that we did not have the convenient gender labels associated with each data point. How could we estimate the two group means?
First let's set up our problem.
In this example we hypothesize that these height data points are drawn from two distributions with two means - < $\mu_1$, $\mu_2$ >.
The heights are the observed $x$ values.
The hidden variables, which EM is going to estimate, can be thought of in the following way. Each $x$ value has 2 associated $z$ values. These $z$ values < $z_1$, $z_2$ > represent the distribution (or class or cluster) that the data point is drawn from.
Understanding the range of values the $z$ values can take is important.
In k-means, the two $z$'s can only take the values of 0 or 1. If the $x$ value came from the first distribution (cluster), then $z_1$=1 and $z_2$=0 and vice versa. This is called hard clustering.
In Gaussian Mixture Models, the $z$'s can take on any value between 0 and 1 because the x values are considered to be drawn probabilistically from 1 of the 2 distributions. For example $z$ values can be $z_1$=0.85 and $z_2$>=0.15, which represents a strong probability that the $x$ value came from distribution 1 and smaller probability that it came from distribution 2. This is called soft or fuzzy clustering.
For this example, we will assume the x values are drawn from Gaussian distributions.
To start the algorithm, we choose two random means.
From there we repeat the following until convergence.
The expectation step:¶
We calculate the expected values $E(z_{ij})$, which is the probability that $x_i$ was drawn from the $jth$ distribution.
$$E(z_{ij}) = \frac{p(x = x_i|\mu = \mu_j)}{\sum_{n=1}^2 p(x = x_i|\mu = \mu_j)}$$$$= \frac{ e^{-\frac{1}{2\sigma^2}(x_i - \mu_j)^2} } { \sum_{n=1}^2e^{-\frac{1}{2\sigma^2}(x_i - \mu_n)^2} }$$The formula simply states that the expected value for $z_{ij}$ is the probability $x_i$ given $\mu_j$ divided by the sum of the probabilities that $x_i$ belonged to each $\mu$
The maximization step:¶
After calculating all $E(z_{ij})$ values we can calculate (update) new $\mu$ values.
$$ \mu_j = \frac {\sum_{i=1}^mE(z_{ij})x_i} {\sum_{i=1}^mE(z_{ij})}$$This formula generates the maximum likelihood estimate.
By repeating the E-step and M-step we are guaranteed to find a local maximum giving us a maximum likelihood estimation of our hypothesis.
What are Maximum Likelihood Estimates (MLE)¶
1. Parameters describe characteristics (attributes) of a population. These parameter values are estimated from samples collected from that population.
2. A MLE is a parameter estimate that is most consistent with the sampled data. By definition it maximizes the likelihood function. One way to accomplish this is to take the first derivative of the likelihood function w/ respect to the parameter theta and solve for 0. This value maximizes the likelihood function and is the MLE
A quick example of a maximum likelihood estimate¶
You flip a coin 10 times and observe the following sequence (H, T, T, H, T, T, T, T, H, T)¶
What's the MLE of observing 3 heads in 10 trials?¶
simple answer:¶
The frequentist MLE is (# of successes) / (# of trials) or 3/10
solving first derivative of binomial distribution answer:¶
\begin{align} \mathcal L(\theta) & = {10 \choose 3}\theta^3(1-\theta)^7 \\[1ex] log\mathcal L(\theta) & = log{10 \choose 3} + 3log\theta + 7log(1 - \theta) \\[1ex] \frac{dlog\mathcal L(\theta)}{d(\theta)} & = \frac 3\theta - \frac{7}{1-\theta} = 0 \\[1ex] \frac 3\theta & = \frac{7}{1 - \theta} \Rightarrow \frac{3}{10} \end{align}That's a MLE! This is the estimate that is most consistent with the observed data¶
Back to our height example. Using the generalized Gaussian mixture model code sourced from Duke's computational statistics we can visualize this process.
# Code sourced from:
# http://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html
def em_gmm_orig(xs, pis, mus, sigmas, tol=0.01, max_iter=20):
n, p = xs.shape
k = len(pis)
ll_old = 0
for itercount in range(max_iter):
print('\nIteration: ', itercount)
print()
exp_A = []
exp_B = []
ll_new = 0
# E-step
ws = np.zeros((k, n))
for j in range(len(mus)):
for i in range(n):
ws[j, i] = pis[j] * mvn(mus[j], sigmas[j]).pdf(xs[i])
ws /= ws.sum(0)
# M-step
pis = np.zeros(k)
for j in range(len(mus)):
for i in range(n):
pis[j] += ws[j, i]
pis /= n
mus = np.zeros((k, p))
for j in range(k):
for i in range(n):
mus[j] += ws[j, i] * xs[i]
mus[j] /= ws[j, :].sum()
sigmas = np.zeros((k, p, p))
for j in range(k):
for i in range(n):
ys = np.reshape(xs[i]- mus[j], (2,1))
sigmas[j] += ws[j, i] * np.dot(ys, ys.T)
sigmas[j] /= ws[j,:].sum()
new_mus = (np.diag(mus)[0], np.diag(mus)[1])
new_sigs = (np.unique(np.diag(sigmas[0]))[0], np.unique(np.diag(sigmas[1]))[0])
df = (pd.DataFrame(index=[1, 2]).assign(mus = new_mus).assign(sigs = new_sigs))
xx = np.linspace(0, 100, 100)
yy = scs.multivariate_normal.pdf(xx, mean=new_mus[0], cov=new_sigs[0])
colors = sns.color_palette('Dark2', 3)
fig, ax = plt.subplots(figsize=(9, 7))
ax.set_ylim(-0.001, np.max(yy))
ax.plot(xx, yy, color=colors[1])
ax.axvline(new_mus[0], ymin=0., color=colors[1])
ax.fill_between(xx, 0, yy, alpha=0.5, color=colors[1])
lo, hi = ax.get_ylim()
ax.annotate(f'$\mu_1$: {new_mus[0]:3.2f}',
fontsize=12, fontweight='demi',
xy=(new_mus[0], (hi-lo) / 2),
xycoords='data', xytext=(80, (hi-lo) / 2),
arrowprops=dict(facecolor='black', connectionstyle="arc3,rad=0.2",shrink=0.05))
ax.fill_between(xx, 0, yy, alpha=0.5, color=colors[2])
yy2 = scs.multivariate_normal.pdf(xx, mean=new_mus[1], cov=new_sigs[1])
ax.plot(xx, yy2, color=colors[2])
ax.axvline(new_mus[1], ymin=0., color=colors[2])
lo, hi = ax.get_ylim()
ax.annotate(f'$\mu_2$: {new_mus[1]:3.2f}',
fontsize=12, fontweight='demi',
xy=(new_mus[1], (hi-lo) / 2), xycoords='data', xytext=(25, (hi-lo) / 2),
arrowprops=dict(facecolor='black', connectionstyle="arc3,rad=0.2",shrink=0.05))
ax.fill_between(xx, 0, yy2, alpha=0.5, color=colors[2])
dot_kwds = dict(markerfacecolor='white', markeredgecolor='black', markeredgewidth=1, markersize=10)
ax.plot(height, len(height)*[0], 'o', **dot_kwds)
ax.set_ylim(-0.001, np.max(yy2))
figureFileName="EM_GMM_iter_%i.png" % itercount
print(figureFileName)
plt.savefig(figureFileName,dpi=600)
print(df.T)
# update complete log likelihoood
ll_new = 0.0
for i in range(n):
s = 0
for j in range(k):
s += pis[j] * mvn(mus[j], sigmas[j]).pdf(xs[i])
ll_new += np.log(s)
print(f'log_likelihood: {ll_new:3.4f}')
if np.abs(ll_new - ll_old) < tol:
break
ll_old = ll_new
return ll_new, pis, mus, sigmas
height = data['Height (in)']
n = len(height)
# Ground truthish
_mus = np.array([[0, data.groupby('Gender').mean().iat[0, 0]],
[data.groupby('Gender').mean().iat[1, 0], 0]])
_sigmas = np.array([[[5, 0], [0, 5]],
[[5, 0],[0, 5]]])
_pis = np.array([0.5, 0.5]) # priors
# initial random guesses for parameters
np.random.seed(0)
pis = np.random.random(2)
pis /= pis.sum()
mus = np.random.random((2,2))
sigmas = np.array([np.eye(2)] * 2) * height.std()
# generate our noisy x values
xs = np.concatenate([np.random.multivariate_normal(mu, sigma, int(pi*n))
for pi, mu, sigma in zip(_pis, _mus, _sigmas)])
ll, pis, mus, sigmas = em_gmm_orig(xs, pis, mus, sigmas)
# In the below plots the white dots represent the observed heights.
Iteration: 0
EM_GMM_iter_0.png
1 2
mus 61.362928 59.659685
sigs 469.240750 244.382352
log_likelihood: -141.8092
Iteration: 1
EM_GMM_iter_1.png
1 2
mus 68.73773 63.620554
sigs 109.85442 7.228183
log_likelihood: -118.0520
Iteration: 2
EM_GMM_iter_2.png
1 2
mus 70.569842 63.688825
sigs 4.424452 3.139277
log_likelihood: -100.2591
Iteration: 3
EM_GMM_iter_3.png
1 2
mus 70.569842 63.688825
sigs 4.424427 3.139278
log_likelihood: -100.2591
Notice how the algorithm was able to estimate the true means starting from random guesses for the parameters.¶
Now that we have a grasp of the algorithm we can examine K-Means as a form of EM¶
K-Means is an unsupervised learning algorithm used for clustering multidimensional data sets.
The basic form of K-Means makes two assumptions
1. Each data point is closer to its own cluster center than the other cluster centers
2. A cluster center is the arithmetic mean of all the points that belong to the cluster.
The expectation step is done by calculating the pairwise distances of every data point and assigning cluster membership to the closest center (mean)
The maximization step is simply the arithmetic mean of the previously assigned data points for each cluster
The following sections borrow heavily from Jake Vanderplas' Python Data Science Handbook¶
# Let's define some demo variables and make some blobs
# demo variables
k = 4
n_draws = 500
sigma = .7
random_state = 0
dot_size = 50
cmap = 'viridis'
# make blobs
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples = n_draws,
centers = k,
cluster_std = sigma,
random_state = random_state)
fig, ax = plt.subplots(figsize=(9,7))
ax.scatter(X[:, 0], X[:, 1], s=dot_size)
plt.title('k-means make blobs', fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'k-means make blobs')
# sample implementation
# code sourced from:
# http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.11-K-Means.ipynb
from sklearn.metrics import pairwise_distances_argmin
def find_clusters(X, n_clusters, rseed=2):
# 1. Random initialization (choose random clusters)
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]
while True:
# 2a. Assign labels based on closest center
labels = pairwise_distances_argmin(X, centers)
# 2b. Find new centers from means of points
new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])
# 2c. Check for convergence
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
# now let's compare this to the sklearn's KMeans() algorithm
# fit k-means to blobs
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
# visualize prediction
fig, ax = plt.subplots(figsize=(9,7))
ax.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=dot_size, cmap=cmap)
# get centers for plot
centers = kmeans.cluster_centers_
ax.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.75)
plt.title('sklearn k-means', fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'sklearn k-means')
# let's test the implementation
centers, labels = find_clusters(X, k)
fig, ax = plt.subplots(figsize=(9,7))
ax.scatter(X[:, 0], X[:, 1], c=labels, s=dot_size, cmap=cmap)
plt.title('find_clusters() k-means func', fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'find_clusters() k-means func')
To build our intuition of this process, play with the following interactive code from Jake Vanderplas in an Jupyter (IPython) notebook¶
# code sourced from:
# http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/06.00-Figure-Code.ipynb#Covariance-Type
from ipywidgets import interact
def plot_kmeans_interactive(min_clusters=1, max_clusters=6):
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=0.60)
def plot_points(X, labels, n_clusters):
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis',
vmin=0, vmax=n_clusters - 1);
def plot_centers(centers):
plt.scatter(centers[:, 0], centers[:, 1], marker='o',
c=np.arange(centers.shape[0]),
s=200, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], marker='o',
c='black', s=50)
def _kmeans_step(frame=0, n_clusters=4):
rng = np.random.RandomState(2)
labels = np.zeros(X.shape[0])
centers = rng.randn(n_clusters, 2)
nsteps = frame // 3
for i in range(nsteps + 1):
old_centers = centers
if i < nsteps or frame % 3 > 0:
labels = pairwise_distances_argmin(X, centers)
if i < nsteps or frame % 3 > 1:
centers = np.array([X[labels == j].mean(0)
for j in range(n_clusters)])
nans = np.isnan(centers)
centers[nans] = old_centers[nans]
# plot the data and cluster centers
plot_points(X, labels, n_clusters)
plot_centers(old_centers)
# plot new centers if third frame
if frame % 3 == 2:
for i in range(n_clusters):
plt.annotate('', centers[i], old_centers[i],
arrowprops=dict(arrowstyle='->', linewidth=1))
plot_centers(centers)
plt.xlim(-4, 4)
plt.ylim(-2, 10)
if frame % 3 == 1:
plt.text(3.8, 9.5, "1. Reassign points to nearest centroid",
ha='right', va='top', size=14)
elif frame % 3 == 2:
plt.text(3.8, 9.5, "2. Update centroids to cluster means",
ha='right', va='top', size=14)
return interact(_kmeans_step, frame=[0, 10, 20, 30, 40, 50, 300],
n_clusters=np.arange(min_clusters, max_clusters+1))
plot_kmeans_interactive()
<function __main__.plot_kmeans_interactive.<locals>._kmeans_step(frame=0, n_clusters=4)>
Now we are ready to explore some of the nuances/issues of implementing K-Means as an expectation maximization algorithm¶
the globally optimal result is not guaranteed¶
- EM is guaranteed to improve the result in each iteration but there are no guarantees that it will find the global best. See the following example, where we initalize the algorithm with a different seed.
practical solution:¶
- Run the algorithm w/ multiple random initializations
- This is done by default in sklearn
centers, labels = find_clusters(X, k, rseed=11)
fig, ax = plt.subplots(figsize=(9,7))
ax.set_title('sub-optimal clustering', fontsize=18, fontweight='demi')
ax.scatter(X[:, 0], X[:, 1], c=labels, s=dot_size, cmap=cmap)
<matplotlib.collections.PathCollection at 0x13d3c1efd60>
number of means (clusters) have to be selected beforehand¶
- k-means cannot learn the optimal number of clusters from the data. If we ask for six clusters it will find six clusters, which may or may not be meaningful.
practical solution:¶
- use a more complex clustering algorithm like Gaussian Mixture Models, or one that can choose a suitable number of clusters (DBSCAN, mean-shift, affinity propagation)
from sklearn.cluster import KMeans
labels6 = KMeans(6, random_state=random_state).fit_predict(X)
fig, ax = plt.subplots(figsize=(9,7))
ax.set_title('too many clusters', fontsize=18, fontweight='demi')
ax.scatter(X[:, 0], X[:, 1], c=labels6, s=dot_size, cmap=cmap)
<matplotlib.collections.PathCollection at 0x13d3c876ca0>
from sklearn.datasets import make_moons
X_mn, y_mn = make_moons(500, noise=.07, random_state=random_state)
labelsM = KMeans(2, random_state=random_state).fit_predict(X_mn)
fig, ax = plt.subplots(figsize=(9,7))
ax.set_title('linear separation not possible', fontsize=18, fontweight='demi')
ax.scatter(X_mn[:, 0], X_mn[:, 1], c=labelsM, s=dot_size, cmap=cmap)
<matplotlib.collections.PathCollection at 0x13d3ca918b0>
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
assign_labels='kmeans')
labelsS = model.fit_predict(X_mn)
fig, ax = plt.subplots(figsize=(9,7))
ax.set_title('kernal transform to higher dimension\nlinear separation is possible', fontsize=18, fontweight='demi')
plt.scatter(X_mn[:, 0], X_mn[:, 1], c=labelsS, s=dot_size, cmap=cmap)
<matplotlib.collections.PathCollection at 0x13d3bee7e80>
K-Means is known as a hard clustering algorithm because clusters are not allowed to overlap.¶
___"One way to think about the k-means model is that it places a circle (or, in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster. This radius acts as a hard cutoff for cluster assignment within the training set: any point outside this circle is not considered a member of the cluster.___ -- [Jake VanderPlas Python Data Science Handbook] 1
# k-means weaknesses that mixture models address directly
# code sourced from:
# http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb
from scipy.spatial.distance import cdist
def plot_kmeans(kmeans, X, n_clusters=k, rseed=2, ax=None):
labels = kmeans.fit_predict(X)
# plot input data
#ax = ax or plt.gca() # <-- nice trick
fig, ax = plt.subplots(figsize=(9,7))
ax.axis('equal')
ax.scatter(X[:, 0], X[:, 1],
c=labels, s=dot_size, cmap=cmap, zorder=2)
# plot the representation of Kmeans model
centers = kmeans.cluster_centers_
radii = [cdist(X[labels==i], [center]).max()
for i, center in enumerate(centers)]
for c, r in zip(centers, radii):
ax.add_patch(plt.Circle(c, r, fc='#CCCCCC',edgecolor='slategrey',
lw=4, alpha=0.5, zorder=1))
return
X3, y_true = make_blobs(n_samples = 400,
centers = k,
cluster_std = .6,
random_state = random_state)
X3 = X3[:, ::-1] # better plotting
kmeans = KMeans(n_clusters=k, random_state=random_state)
plot_kmeans(kmeans, X3)
plt.title('Clusters are hard circular boundaries', fontsize=18, fontweight='demi')
plt.savefig('Kmeans_circular.png',dpi=600)
A resulting issue of K-Means' circular boundaries is that it has no way to account for oblong or elliptical clusters.¶
rng = np.random.RandomState(13)
X3_stretched = np.dot(X3, rng.randn(2, 2))
kmeans = KMeans(n_clusters=k, random_state=random_state)
plot_kmeans(kmeans, X3_stretched)
plt.title('Clusters cannot adjust to elliptical data structures',
fontsize=18, fontweight='demi')
plt.savefig('Kmeans_elliptical.png',dpi=600)
There are two ways we can extend K-Means¶
1. measure uncertainty in cluster assignments by comparing distances to all cluster centers
2. allow for flexibility in the shape of the cluster boundaries by using ellipses
Recall our previous height example, and let's assume that each cluster is a Gaussian distribution!¶
Gaussian distributions give flexibility to the clustering, and the same basic two step E-M algorithm used in K-Means is applied here as well.¶
Randomly initialize location and shape
Repeat until converged: E-step: for each point, find weights encoding the probability of membership in each cluster.
M-step: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights
The result of this process is that we end up with a smooth Gaussian cluster better fitted to the shape of the data, instead of a rigid inflexible circle.¶
Note that because we still are using the E-M algorithm there is no guarantee of a globally optimal result. We can visualize the results of the model.¶
# code sourced from:
# http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb
from matplotlib.patches import Ellipse
def draw_ellipse(position, covariance, ax=None, **kwargs):
"""Draw an ellipse with a given position and covariance"""
# Convert covariance to principal axes
if covariance.shape == (2, 2):
U, s, Vt = np.linalg.svd(covariance)
angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)
# Draw the Ellipse
for nsig in range(1, 4):
ax.add_patch(Ellipse(position, nsig * width, nsig * height,
angle, **kwargs))
def plot_gmm(gmm, X, label=True, ax=None):
fig, ax = plt.subplots(figsize=(9,7))
ax = ax or plt.gca()
labels = gmm.fit(X).predict(X)
if label:
ax.scatter(X[:, 0], X[:, 1], c=labels, s=dot_size, cmap=cmap, zorder=2)
else:
ax.scatter(X[:, 0], X[:, 1], s=dot_size, zorder=2)
ax.axis('equal')
w_factor = 0.2 / gmm.weights_.max()
for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
draw_ellipse(pos, covar, ax=ax, alpha=w * w_factor)
gmm = mix.GaussianMixture(n_components=k, random_state=random_state)
plot_gmm(gmm, X3)
plt.title('GMM isotropic clusters', fontsize=18, fontweight='demi')
plt.savefig('GMM_circular.png',dpi=600)
# lets test on the stretched data set
gmm = mix.GaussianMixture(n_components=k, random_state=random_state+1)
plot_gmm(gmm, X3_stretched)
plt.title('GMM elliptical clusters', fontsize=18, fontweight='demi')
plt.savefig('GMM_elliptical.png',dpi=600)
Notice how much better the model is able to fit the clusters when we assume each cluster is a Gaussian distribution instead of circle whose radius is defined by the most distant point.¶
Gaussian Mixture Models as a tool for Density Estimation¶
The technical term for this type of model is:¶
generative probabilistic model
Why you ask?¶
Because this model is really about characterizing the distribution of the entire dataset and not necessarily clustering. The power of these types of models is that they allow us to generate new samples that mimic the original underlying data!
gmm2 = mix.GaussianMixture(n_components=2, covariance_type='full',
random_state=random_state)
plot_gmm(gmm2, X_mn)
If we try to cluster this data set we run into the same issue as before.
Instead let's ignore individual clusters and model the whole distribution of data as a collection of many Gaussians.
gmm16 = mix.GaussianMixture(n_components=16, covariance_type='full',
random_state=random_state)
plot_gmm(gmm16, X_mn, label=False)
plt.title('Collective Gaussian clusters',
fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'Collective Gaussian clusters')
Looks like the collection of clusters has fit the data set reasonably well. Now let's see if the model has actually learned about this data set, such that we can create entirely new samples that look like the original.
Xnew, ynew = gmm16.sample(500)
fig, ax = plt.subplots(figsize=(9,7))
ax.scatter(Xnew[:, 0], Xnew[:, 1]);
ax.set_title('New samples drawn from fitted model',
fontsize=18, fontweight='demi')
Text(0.5, 1.0, 'New samples drawn from fitted model')
Generative models allow for multiple methods to determine optimal number of components. Because it is a probability distribution we can evaluate the likelihood of the data using cross validation and/or using AIC or BIC.
Sklearn makes this easy.
n_components = np.arange(1, 21)
models = [mix.GaussianMixture(n, covariance_type='full',
random_state=random_state).fit(X_mn)
for n in n_components]
fig, ax = plt.subplots(figsize=(9,7))
ax.plot(n_components, [m.bic(X_mn) for m in models], label='BIC')
ax.plot(n_components, [m.aic(X_mn) for m in models], label='AIC')
ax.axvline(np.argmin([m.bic(X_mn) for m in models]), color='blue')
ax.axvline(np.argmin([m.aic(X_mn) for m in models]), color='green')
plt.legend(loc='best')
plt.xlabel('n_components')
Text(0.5, 0, 'n_components')